Coder Social home page Coder Social logo

Comments (7)

willamowius avatar willamowius commented on June 23, 2024

I can reproduce the issue, but I have no idea what the cause is. I'm suspecting PTlib as well.
I have seen similar issues on *BSD, but it is strange that Ubuntu is fine and Kubuntu isn't.

The easiest workaround is to _exit(0) and not close the listeners (or use a distro that doesn't have the issue).

Please keep us updated if you find the cause!

from ptlib.

Lastique avatar Lastique commented on June 23, 2024

I believe this is caused by a change in behavior of pthread_kill in glibc. The following test case demonstrates the difference in behavior between Kubuntu 21.10 (glibc 2.34) and Debian 11 (glibc 2.31).

#include <stdio.h>
#include <pthread.h>
#include <signal.h>
#include <unistd.h>

void* thread_func(void* p)
{
    puts("thread is running");
    fflush(stdout);
    sleep(3);
    puts("thread is terminated");
    fflush(stdout);
}

int main()
{
    pthread_t th;

    puts("launching thread");
    fflush(stdout);

    pthread_create(&th, 0, &thread_func, 0);

    for (unsigned int i = 0; i < 5; ++i)
    {
        sleep(1);
        int res = pthread_kill(th, 0);
        printf("pthread_kill returns %d\n", res);
        fflush(stdout);
    }

    pthread_join(th, 0);
}

After the secondary thread terminates, pthread_kill continues to return 0 on glibc 2.34 and returns 3 on 2.31. This affects PThread::IsTerminated, which never returns true on glibc 2.34, and therefore PThread::WaitForTermination doesn't terminate until timeout.

POSIX specification of pthread_kill does not require the function to return an error if the thread does not exist. In fact, the Rationale explicitly notes that implementations are inconsistent wrt. this behavior, and some implementations return success in this case. IMO, PTLib is incorrect in using pthread_kill to test if the thread is terminated. It was (un)lucky that glibc previously allowed for this to work.

Further, I noted that PThread::m_threadIdValid is never set to false in my test case. The secondary thread is cancelled with pthread_cancel and has no chance of indicating that it has stopped running. Looks like the flag is supposed to be set by PHouseKeepingThread::Main, but that thread is never running in my test case. Besides, the accesses to PThread::m_threadIdValid are not properly synchronized between threads, so we have a data race as well. (And even if PThread::m_threadIdValid is a single byte, this is still a data race as the compiler is free to assume the variable is not modified by other threads and cache its value across multiple iterations of the loop in PThread::WaitForTermination instead of reloading it from memory. Which, I think, is what also happens, though I may be misreading the disassembler.)

Overall, I find threading implementation in PTLib overly complicated and... dodgy. Using pthread_cancel in PThread::Terminate means the thread does not run destructors of its automatic variables, which means it may render the program completely broken (e.g. if some libc or libstdc++ global resource ends up being locked as a result), not to mention potential resource leaks. The story with PHouseKeepingThread is also not clear to me. Why is there a dedicated thread for joining threads instead of using pthread_join/pthread_timed_join or a mechanism based on condition variable in PThread::WaitForTermination? Finally, why PHouseKeepingThread is not running at all?

Anyway, to deal with the immediate problem of PThread::IsTerminated not working we need to remove the use of pthread_kill and replace it with something that works with pthread_cancel and normal thread termination. I'm thinking of using pthread TLS destructors for this (see pthread_key_create), but I haven't tested this idea yet and don't know if PTLib already has something like this in place. This will also be a POSIX-specific solution.

from ptlib.

willamowius avatar willamowius commented on June 23, 2024

I ran a few tests with master and callgen323 on Ubuntu 21.10 and its crashing reproducibly in PThread::PX_ThreadStart().
Haven't had time to test without the patch, yet, but something is fishy on the new Glibc.

from ptlib.

folarte avatar folarte commented on June 23, 2024

My experiments are not too good, but I'll summarize in case it helps more knowledgeable people:

Background: I'm running an IVR based on h323plus. It's been running somehow reliably for years, millions of call per month. as part of ongoing evolution we are rebuilding new machines and tried to change from amazon linux 2 to ubuntu. First tests where done on 20.04 LTS, which runs well, but I decided to use 21.10 ( with release upgrades pointed to LTS ) as I thought it will be an easier upgrade to 22.04 when it comes out. As soon as I did it I encountered problems with make 4.3, which I solved backporting the patches in master, and hanging threads, which I tried to solve with this commit.

Without the patch my program reliably hit the bug, and was asserting on call hangups.

As soon as I applied the patch I reliably got SIGSEGV when trying to answer incomming call. Inside h323plus, at channels.cxx(110), which is just a virtual function call. I suspected memory corruption.

I tried it under GDB, and it normally got SIGSEGV, but a few times it worked OK. When It faulted I observed H323connectionThread / H323Channel seemed corrupt, but by limited gdb skills did not allow me to progress further.

I tried valgrind, which has helped me in the past. And then it works, but I got systematic reports "Conditional jump or move depends on uninitialised value(s)", which was strange. They all were pointing to accesses of h323connection.m_maintanedConnection, either via the accesor function IsMaintanedConnection when using it in the H323SignalPDU::BuildXXXX or in a direct access in the destructor. I checked it is always initialized in the constructor, and being paranoid I even put an =false in the definition, but they still surface, which leads me to think the connection reference passed in is invalid.

I now suspect the addition of m_isrunning has altered the mem layout an what previously was a harmless borked pointer is corrupting the channel or connection pointers inside h323connectionThread or h323Channel. I am trying o locate it, I will write anything I find.

In case it helps, this are the traces of one of the valgrind runs, although I suspect they are not guilty I am trying to use them to try to find the corrupted data.

==166740== Thread 13:
==166740== Conditional jump or move depends on uninitialised value(s)
==166740== at 0x4E06EB0: H323SignalPDU::BuildCallProceeding(H323Connection const&) (h323pdu.cxx:781)
==166740== by 0x4DD12D6: H323Connection::OnReceivedSignalSetup(H323SignalPDU const&) (h323.cxx:1507)
==166740== by 0x1EE265: PTLeg::OnReceivedSignalSetup(H323SignalPDU const&) (leg.cxx:352)
==166740== by 0x4DCF496: H323Connection::HandleSignalPDU(H323SignalPDU&) (h323.cxx:1137)
==166740== by 0x4E39AA0: H323Transport::HandleFirstSignallingChannelPDU(PThread*) (transports.cxx:1150)
==166740== by 0x4E35591: H225TransportThread::Main() (transports.cxx:152)
==166740== by 0x544B314: PThread::PX_ThreadStart(void*) (tlibthrd.cxx:526)
==166740== by 0x592B926: start_thread (pthread_create.c:435)
==166740== by 0x59BB9E3: clone (clone.S:100)
==166740==
==166740== Conditional jump or move depends on uninitialised value(s)
==166740== at 0x4E072F4: H323SignalPDU::BuildAlerting(H323Connection const&) (h323pdu.cxx:865)
==166740== by 0x4DD13B5: H323Connection::OnReceivedSignalSetup(H323SignalPDU const&) (h323.cxx:1527)
==166740== by 0x1EE265: PTLeg::OnReceivedSignalSetup(H323SignalPDU const&) (leg.cxx:352)
==166740== by 0x4DCF496: H323Connection::HandleSignalPDU(H323SignalPDU&) (h323.cxx:1137)
==166740== by 0x4E39AA0: H323Transport::HandleFirstSignallingChannelPDU(PThread*) (transports.cxx:1150)
==166740== by 0x4E35591: H225TransportThread::Main() (transports.cxx:152)
==166740== by 0x544B314: PThread::PX_ThreadStart(void*) (tlibthrd.cxx:526)
==166740== by 0x592B926: start_thread (pthread_create.c:435)
==166740== by 0x59BB9E3: clone (clone.S:100)
==166740==
==166740== Conditional jump or move depends on uninitialised value(s)
==166740== at 0x4E07042: H323SignalPDU::BuildConnect(H323Connection const&) (h323pdu.cxx:814)
==166740== by 0x4DD1979: H323Connection::OnReceivedSignalSetup(H323SignalPDU const&) (h323.cxx:1606)
==166740== by 0x1EE265: PTLeg::OnReceivedSignalSetup(H323SignalPDU const&) (leg.cxx:352)
==166740== by 0x4DCF496: H323Connection::HandleSignalPDU(H323SignalPDU&) (h323.cxx:1137)
==166740== by 0x4E39AA0: H323Transport::HandleFirstSignallingChannelPDU(PThread*) (transports.cxx:1150)
==166740== by 0x4E35591: H225TransportThread::Main() (transports.cxx:152)
==166740== by 0x544B314: PThread::PX_ThreadStart(void*) (tlibthrd.cxx:526)
==166740== by 0x592B926: start_thread (pthread_create.c:435)
==166740== by 0x59BB9E3: clone (clone.S:100)
==166740==
==166740== Thread 9:
==166740== Conditional jump or move depends on uninitialised value(s)
==166740== at 0x4DCD5D9: H323Connection::~H323Connection() (h323.cxx:674)
==166740== by 0x1ED4A7: PTLeg::~PTLeg() (leg.cxx:107)
==166740== by 0x174E69: PTLegAbstract::~PTLegAbstract() (leg.h:176)
==166740== by 0x1EE5A1: PTLegWithEventReceiver::~PTLegWithEventReceiver() (leg.cxx:398)
==166740== by 0x1EE5E5: PTLegWithEventReceiver::~PTLegWithEventReceiver() (leg.cxx:398)
==166740== by 0x4DF4189: H323EndPoint::CleanUpConnections() (h323ep.cxx:2744)
==166740== by 0x4DEB5F1: H323ConnectionsCleaner::Main() (h323ep.cxx:665)
==166740== by 0x544B314: PThread::PX_ThreadStart(void*) (tlibthrd.cxx:526)
==166740== by 0x592B926: start_thread (pthread_create.c:435)
==166740== by 0x59BB9E3: clone (clone.S:100)
==166740==

PTLEG are our connection derived classes, I suspect the problem comes from a bad pointer stored somewhere, which calls ptleg/h323connection with a bad this.

from ptlib.

Lastique avatar Lastique commented on June 23, 2024

I now suspect the addition of m_isrunning has altered the mem layout

m_isRunning is indeed a new field and could potentially affect binary layout. I tried to place it so that it fits in the internal padding, though I haven't checked it actually did. In any case, you should rebuild everything from scratch when you apply the patch, and make sure P_PTHREADS is defined (it should, by auto-generated ptbuildopts.h, which is included by ptlib.h, so you need to ensure you're including it in your code).

Also, see my comments in #11.

from ptlib.

folarte avatar folarte commented on June 23, 2024

I rebuild everything on tests, not always, but periodically I star with deleting, unpacking, patching, configuring, recompiling.

This last info was just a make both, which recompiled everything but as it seems crude I will repeat it starting with unpacking and manual patching.

One data point. Trying to gain some insight into the problem I tried a padding trick which has helped me in the past. I added a dummy padding to PThread.

protected:
bool m_isProcess;
bool m_autoDelete; // Automatically delete the thread on completion.
#if defined(P_PTHREADS)
std::atomic_bool m_isRunning;
#endif
PINDEX m_originalStackSize;

PString m_threadName; // Give the thread a name for debugging purposes.                                       
PMutex  m_threadNameMutex;

PThreadIdentifier m_threadId;
PBoolean m_threadIdValid;

// Try to move alignment
int padding[1000]={0};

And to my surprise, after rebuilding, it exploded in pthread_create, I'm still trying to debug it but so far, if I'm right in padding being value-initialized to 0 ( next test is to fill it ), someone is corrupting my memory:
( padding seems to be severely damaged, but my assumptions may be wrong, I'll check a bit more. )

2022-02-21 11:38:04.743 2adca INFO promptron.start : OK, redirecting log.
[New Thread 0x7ffff406d640 (LWP 175568)]
[New Thread 0x7fffeffff640 (LWP 175569)]
[New Thread 0x7fffef83f640 (LWP 175570)]
[New Thread 0x7fffef03e640 (LWP 175571)]
[New Thread 0x7fffee83d640 (LWP 175572)]
[New Thread 0x7fffee03c640 (LWP 175573)]
[New Thread 0x7fffed83b640 (LWP 175574)]
malloc(): invalid size (unsorted)

Thread 1 "promptronv3" received signal SIGABRT, Aborted.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737287769600) at pthread_kill.c:44
44 pthread_kill.c: No such file or directory.
(gdb) bt
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737287769600) at pthread_kill.c:44
#1 __pthread_kill_internal (signo=6, threadid=140737287769600) at pthread_kill.c:80
#2 __GI___pthread_kill (threadid=140737287769600, signo=signo@entry=6) at pthread_kill.c:91
#3 0x00007ffff6da1476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4 0x00007ffff6d877b7 in __GI_abort () at abort.c:79
#5 0x00007ffff6de85e6 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff6f3a13d "%s\n")
at ../sysdeps/posix/libc_fatal.c:155
#6 0x00007ffff6dffadc in malloc_printerr (str=str@entry=0x7ffff6f3d1d0 "malloc(): invalid size (unsorted)")
at malloc.c:5543
#7 0x00007ffff6e0297c in _int_malloc (av=av@entry=0x7ffff6f73c60 <main_arena>, bytes=bytes@entry=352)
at malloc.c:3889
#8 0x00007ffff6e04cfe in __libc_calloc (n=, elem_size=) at malloc.c:3566
#9 0x00007ffff7fdccac in calloc (b=16, a=22) at ../include/rtld-malloc.h:44
#10 allocate_dtv (result=0x7fffed7fa640) at ../elf/dl-tls.c:375
#11 __GI__dl_allocate_tls (mem=mem@entry=0x7fffed7fa640) at ../elf/dl-tls.c:623
#12 0x00007ffff6df45da in allocate_stack (stacksize=, stack=,
pdp=, attr=0x7fffffffd810) at allocatestack.c:431
#13 __pthread_create_2_1 (newthread=0x5555559df0e8, attr=0x7fffffffd810,
start_routine=0x7ffff74fa17e PThread::PX_ThreadStart(void*), arg=0x5555559df080) at pthread_create.c:623
#14 0x00007ffff74fa5fb in PThread::Restart (this=0x5555559df080) at ptlib/unix/tlibthrd.cxx:590
#15 0x00007ffff74faa49 in PThread::Suspend (this=0x5555559df080, susp=false) at ptlib/unix/tlibthrd.cxx:650
#16 0x00007ffff74fac52 in PThread::Resume (this=0x5555559df080) at ptlib/unix/tlibthrd.cxx:695
#17 0x00007ffff7d947b0 in H323EndPoint::StartListener (this=0x5555559dd1b0, listener=0x5555559df080)
at h323ep.cxx:1807
#18 0x00007ffff7d943d3 in H323EndPoint::StartListener (this=0x5555559dd1b0, iface=...) at h323ep.cxx:1774
#19 0x00007ffff7d9427f in H323EndPoint::StartListeners (this=0x5555559dd1b0, ifaces=...) at h323ep.cxx:1758
#20 0x00005555555e9406 in PTEndPoint::Initialise (this=0x5555559dd1b0, epsect=...) at endpoint.cxx:183
#21 0x00005555555ea823 in endpoints::new_ep_from_ini (sectname="PRIRL_A_00", epsect=...) at endpoint.cxx:335
#22 0x00005555555eacc9 in endpoints::reload_all () at endpoint.cxx:381
#23 0x00005555555eb00a in endpoint_init (cfg=...) at endpoint.cxx:424
#24 0x00005555555a9a4c in PrompTron::OnStart (this=0x555555820800) at main.cxx:93
#25 0x00007ffff74cb622 in PServiceProcess::InternalMain (this=0x555555820800) at ptlib/unix/svcproc.cxx:455
#26 0x00005555555a929c in main (argc=4, argv=0x7fffffffe0c8, envp=0x7fffffffe0f0) at main.cxx:29
(gdb) frame 14
#14 0x00007ffff74fa5fb in PThread::Restart (this=0x5555559df080) at ptlib/unix/tlibthrd.cxx:590
590 PAssertPTHREAD(pthread_create, (&m_threadId, &threadAttr, PX_ThreadStart, this));
(gdb) print *this
$1 = { = {_vptr.PObject = 0x7ffff7f9dbb0 <vtable for H323ListenerTCP+16>}, m_isProcess = false,
m_autoDelete = false, m_isRunning = {_M_base = {static _S_alignment = 1, _M_i = false},
static is_always_lock_free = true}, m_originalStackSize = 30000,
m_threadName = { = {<PBaseArray> = { = { = { = {
_vptr.PObject = 0x55555570f260 <vtable for PString+16>}, reference = 0x5555559ddfc8},
elementSize = 1, theArray = 0x5555557ef2d0 "H323Listener:%0x",
allocatedDynamically = true}, }, }, },
m_threadNameMutex = { = { = {
_vptr.PObject = 0x7ffff760d120 <vtable for PTimedMutex+16>}, }, m_mutex = {__data = {
__lock = 0, __count = 0, __owner = 0, __nusers = 0, __kind = 1, __spins = 0, __elision = 0, __list = {
__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 16 times>, "\001", '\000' <repeats 22 times>,
__align = 0}}, m_threadId = 0, m_threadIdValid = false, padding = {0 <repeats 33 times>, 1436406192,
21845, 0, 0, 1433463776, 21845, 1433464248, 21845, 0, 0, 1433464288, 21845, 24, 0 <repeats 11 times>,
1433464656, 21845, 2147483647, 0, 1433464656, 21845, 2147483647, 0, 0, 0, 1433465440, 21845, 1436409848,
21845, 1, 0, 1434365768, 21845, 1, 0, -144649952, 32767, 0, 0, 0, 0, 1, 0 <repeats 11 times>, -144649952,
32767, 0, 0, 0, 0, 1, 0 <repeats 11 times>, -144649952, 32767, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, -144649952,
32767, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, -144649952, 32767, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 17200, 0, 1433466400,
21845, 16777343, 4, 1433464328, 21845, 6, 0, 0, 0, 4098, 0 <repeats 41 times>, 8, 0, 1436414880, 21845,
-148271840, 32767, 0, 0, 0, 0, 1436351520, 21845, -148273344, 32767, -148273456, 32767, -148273440, 32767,
1433466400, 21845, 822083850, 4, 0 <repeats 775 times>}, traceInfo = {
traceStreams = { = { = { = { = {
_vptr.PObject = 0x7ffff760d410 <vtable for PStack+16>},
reference = 0x5555559ddf98}, }, info = 0x5555559ddfe0}, },
traceLevel = 0, traceBlockIndentLevel = 0}, PX_priority = PThread::NormalPriority, PX_linuxId = 0,
PX_startTick = { = {_vptr.PObject = 0x55555570ef50 <vtable for PTimeInterval+16>},
m_milliseconds = 0}, PX_endTick = { = {
_vptr.PObject = 0x55555570ef50 <vtable for PTimeInterval+16>}, m_milliseconds = 0}, PX_suspendMutex = {
__data = {__lock = 1, __count = 0, __owner = 175562, __nusers = 1, __kind = 0, __spins = 0, __elision = 0,
__list = {__prev = 0x0, __next = 0x0}},
__size = "\001\000\000\000\000\000\000\000ʭ\002\000\001", '\000' <repeats 26 times>, __align = 1},
PX_suspendCount = 0, PX_firstTimeStart = false, unblockPipe = {22, 23}}

from ptlib.

folarte avatar folarte commented on June 23, 2024

After testing with current master of h323plus and ptlib all my problems have vanished. No hangs, no cores on thread starts, no asserts due to failed terminations.

from ptlib.

Related Issues (8)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.