Coder Social home page Coder Social logo

Comments (23)

mmusich avatar mmusich commented on August 23, 2024 1

assign hlt

from cmssw.

mmusich avatar mmusich commented on August 23, 2024 1

What is the purpose of export MALLOC_CONF=junk:true?

Setting MALLOC_CONF=junk:true configures the memory allocator (malloc) to fill newly allocated memory with junk data.
When junk:true is set, the memory allocator fills each block of memory that it allocates with a pattern of junk data. This can help detect bugs such as reads from or writes to uninitialized memory, because accessing this junk data will likely lead to unexpected behavior, such as crashes or incorrect results.

from cmssw.

cmsbuild avatar cmsbuild commented on August 23, 2024

cms-bot internal usage

from cmssw.

cmsbuild avatar cmsbuild commented on August 23, 2024

A new Issue was created by @VinInn.

@makortel, @Dr15Jones, @smuzaffar, @antoniovilela, @rappoccio, @sextonkennedy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

from cmssw.

VinInn avatar VinInn commented on August 23, 2024

Does not seem to happen in MC Relvals

from cmssw.

VinInn avatar VinInn commented on August 23, 2024

running "single thread" this is the stack-trace

Thread 12 (Thread 0x7fe7d1fff700 (LWP 3848515) "cmsRun"):
#0  0x00007fe8cd166301 in poll () from /usr/lib64/libc.so.6
#1  0x00007fe8c29de2ff in full_read.constprop () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#2  0x00007fe8c2991afc in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#3  0x00007fe8c2992460 in sig_dostack_then_abort () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#4  <signal handler called>
#5  0x00007fe82e9c0974 in HLTRecHitInAllL1RegionsProducer<EcalRecHit>::produce(edm::Event&, edm::EventSetup const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginRecoEgammaEgammaHLTProducersPlugins.so
#6  0x00007fe8cfbbb47f in edm::stream::EDProducerAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#7  0x00007fe8cfb9fc2c in edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#8  0x00007fe8cfb27f59 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#9  0x00007fe8cfb284c5 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#10 0x00007fe8cfcdbf78 in tbb::detail::d1::function_task<edm::WaitingTaskList::announce()::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreConcurrency.so
#11 0x00007fe8ce2ca95b in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::outermost_worker_waiter> (t=0x7fe811635e00, waiter=..., this=0x7fe8caedbe80) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#12 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::outermost_worker_waiter> (t=0x0, waiter=..., this=0x7fe8caedbe80) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#13 tbb::detail::r1::arena::process (tls=..., this=<optimized out>) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/arena.cpp:137
#14 tbb::detail::r1::market::process (this=<optimized out>, j=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/market.cpp:599
#15 0x00007fe8ce2ccb0e in tbb::detail::r1::rml::private_worker::run (this=0x7fe8c7f7a100) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/private_server.cpp:271
#16 tbb::detail::r1::rml::private_worker::thread_routine (arg=0x7fe8c7f7a100) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/private_server.cpp:221
#17 0x00007fe8cd40f1ca in start_thread () from /usr/lib64/libpthread.so.0
#18 0x00007fe8cd07be73 in clone () from /usr/lib64/libc.so.6
Thread 11 (Thread 0x7fe7d4dff700 (LWP 3848514) "cmsRun"):
#0  0x00007fe8cd417da6 in do_futex_wait.constprop () from /usr/lib64/libpthread.so.0
#1  0x00007fe8cd417e98 in __new_sem_wait_slow.constprop.0 () from /usr/lib64/libpthread.so.0
#2  0x00007fe8adbd8dda in ?? () from /usr/lib64/libcuda.so.1
#3  0x00007fe8adbe8373 in ?? () from /usr/lib64/libcuda.so.1
#4  0x00007fe8cd40f1ca in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007fe8cd07be73 in clone () from /usr/lib64/libc.so.6
Thread 10 (Thread 0x7fe7f9b17700 (LWP 3848298) "cmsRun"):
#0  0x00007fe8cd13b918 in nanosleep () from /usr/lib64/libc.so.6
#1  0x00007fe8cd169178 in usleep () from /usr/lib64/libc.so.6
#2  0x00007fe8c051cdfa in FedRawDataInputSource::readSupervisor() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libEventFilterUtilities.so
#3  0x00007fe8cdaa3a73 in std::execute_native_thread_routine (__p=0x7fe8058ae2e0) at ../../../../../libstdc++-v3/src/c++11/thread.cc:82
#4  0x00007fe8cd40f1ca in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007fe8cd07be73 in clone () from /usr/lib64/libc.so.6
Thread 9 (Thread 0x7fe80ab57700 (LWP 3848262) "cmsRun"):
#0  0x00007fe8cd13b918 in nanosleep () from /usr/lib64/libc.so.6
#1  0x00007fe8cd13b81e in sleep () from /usr/lib64/libc.so.6
#2  0x00007fe8c051239a in evf::FastMonitoringService::snapshotRunner() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libEventFilterUtilities.so
#3  0x00007fe8cdaa3a73 in std::execute_native_thread_routine (__p=0x7fe805aa42e0) at ../../../../../libstdc++-v3/src/c++11/thread.cc:82
#4  0x00007fe8cd40f1ca in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007fe8cd07be73 in clone () from /usr/lib64/libc.so.6
Thread 8 (Thread 0x7fe8211ff700 (LWP 3848219) "cmsRun"):
#0  0x00007fe8cd41545c in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
#1  0x00007fe849bf2f9e in Eigen::ThreadPoolTempl<tsl::thread::EigenEnvironment>::WaitForWork(Eigen::EventCount::Waiter*, tsl::thread::EigenEnvironment::Task*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/external/el8_amd64_gcc12/lib/scram_x86-64-v3/libtensorflow_cc.so.2
#2  0x00007fe849bf3563 in Eigen::ThreadPoolTempl<tsl::thread::EigenEnvironment>::WorkerLoop(int) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/external/el8_amd64_gcc12/lib/scram_x86-64-v3/libtensorflow_cc.so.2
#3  0x00007fe849bf0c78 in std::_Function_handler<void (), tsl::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/external/el8_amd64_gcc12/lib/scram_x86-64-v3/libtensorflow_cc.so.2
#4  0x00007fe83c494e32 in tsl::(anonymous namespace)::PThread::ThreadFn(void*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/external/el8_amd64_gcc12/lib/scram_x86-64-v3/libtensorflow_framework.so.2
#5  0x00007fe8cd40f1ca in start_thread () from /usr/lib64/libpthread.so.0
#6  0x00007fe8cd07be73 in clone () from /usr/lib64/libc.so.6
Thread 7 (Thread 0x7fe82636d700 (LWP 3848218) "cmsRun"):
#0  0x00007fe8cd41545c in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
#1  0x00007fe849bf2f9e in Eigen::ThreadPoolTempl<tsl::thread::EigenEnvironment>::WaitForWork(Eigen::EventCount::Waiter*, tsl::thread::EigenEnvironment::Task*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/external/el8_amd64_gcc12/lib/scram_x86-64-v3/libtensorflow_cc.so.2
#2  0x00007fe849bf3563 in Eigen::ThreadPoolTempl<tsl::thread::EigenEnvironment>::WorkerLoop(int) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/external/el8_amd64_gcc12/lib/scram_x86-64-v3/libtensorflow_cc.so.2
#3  0x00007fe849bf0c78 in std::_Function_handler<void (), tsl::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/external/el8_amd64_gcc12/lib/scram_x86-64-v3/libtensorflow_cc.so.2
#4  0x00007fe83c494e32 in tsl::(anonymous namespace)::PThread::ThreadFn(void*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/external/el8_amd64_gcc12/lib/scram_x86-64-v3/libtensorflow_framework.so.2
#5  0x00007fe8cd40f1ca in start_thread () from /usr/lib64/libpthread.so.0
#6  0x00007fe8cd07be73 in clone () from /usr/lib64/libc.so.6
Thread 6 (Thread 0x7fe826b6e700 (LWP 3848217) "cmsRun"):
#0  0x00007fe8cd41545c in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
#1  0x00007fe849bf2f9e in Eigen::ThreadPoolTempl<tsl::thread::EigenEnvironment>::WaitForWork(Eigen::EventCount::Waiter*, tsl::thread::EigenEnvironment::Task*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/external/el8_amd64_gcc12/lib/scram_x86-64-v3/libtensorflow_cc.so.2
#2  0x00007fe849bf3563 in Eigen::ThreadPoolTempl<tsl::thread::EigenEnvironment>::WorkerLoop(int) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/external/el8_amd64_gcc12/lib/scram_x86-64-v3/libtensorflow_cc.so.2
#3  0x00007fe849bf0c78 in std::_Function_handler<void (), tsl::thread::EigenEnvironment::CreateThread(std::function<void ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/external/el8_amd64_gcc12/lib/scram_x86-64-v3/libtensorflow_cc.so.2
#4  0x00007fe83c494e32 in tsl::(anonymous namespace)::PThread::ThreadFn(void*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/external/el8_amd64_gcc12/lib/scram_x86-64-v3/libtensorflow_framework.so.2
#5  0x00007fe8cd40f1ca in start_thread () from /usr/lib64/libpthread.so.0
#6  0x00007fe8cd07be73 in clone () from /usr/lib64/libc.so.6
Thread 5 (Thread 0x7fe830d65700 (LWP 3848154) "cmsRun"):
#0  0x00007fe8cd41545c in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
#1  0x00007fe8c0518436 in FedRawDataInputSource::readWorker(unsigned int) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libEventFilterUtilities.so
#2  0x00007fe8cdaa3a73 in std::execute_native_thread_routine (__p=0x7fe8b55bde30) at ../../../../../libstdc++-v3/src/c++11/thread.cc:82
#3  0x00007fe8cd40f1ca in start_thread () from /usr/lib64/libpthread.so.0
#4  0x00007fe8cd07be73 in clone () from /usr/lib64/libc.so.6
Thread 4 (Thread 0x7fe8861de700 (LWP 3848143) "cuda-EvtHandlr"):
#0  0x00007fe8cd166301 in poll () from /usr/lib64/libc.so.6
#1  0x00007fe8adbed89f in ?? () from /usr/lib64/libcuda.so.1
#2  0x00007fe8adcbbdcf in ?? () from /usr/lib64/libcuda.so.1
#3  0x00007fe8adbe8373 in ?? () from /usr/lib64/libcuda.so.1
#4  0x00007fe8cd40f1ca in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007fe8cd07be73 in clone () from /usr/lib64/libc.so.6
Thread 3 (Thread 0x7fe88de26700 (LWP 3848140) "cuda0000380000f"):
#0  0x00007fe8cd166301 in poll () from /usr/lib64/libc.so.6
#1  0x00007fe8adbed89f in ?? () from /usr/lib64/libcuda.so.1
#2  0x00007fe8adcbbdcf in ?? () from /usr/lib64/libcuda.so.1
#3  0x00007fe8adbe8373 in ?? () from /usr/lib64/libcuda.so.1
#4  0x00007fe8cd40f1ca in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007fe8cd07be73 in clone () from /usr/lib64/libc.so.6
Thread 2 (Thread 0x7fe88e6cc700 (LWP 3848133) "cmsRun"):
#0  0x00007fe8cd419672 in waitpid () from /usr/lib64/libpthread.so.0
#1  0x00007fe8c298eb37 in edm::service::cmssw_stacktrace_fork() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#2  0x00007fe8c2991a2a in edm::service::InitRootHandlers::stacktraceHelperThread() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#3  0x00007fe8cdaa3a73 in std::execute_native_thread_routine (__p=0x7fe8c2dbb650) at ../../../../../libstdc++-v3/src/c++11/thread.cc:82
#4  0x00007fe8cd40f1ca in start_thread () from /usr/lib64/libpthread.so.0
#5  0x00007fe8cd07be73 in clone () from /usr/lib64/libc.so.6
Thread 1 (Thread 0x7fe8cc599640 (LWP 3848049) "cmsRun"):
#0  0x00007fe8cd13b918 in nanosleep () from /usr/lib64/libc.so.6
#1  0x00007fe8cd13b81e in sleep () from /usr/lib64/libc.so.6
#2  0x00007fe8c298e9e0 in sig_pause_for_stacktrace () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x00007ffc41330bba in clock_gettime ()
#5  0x00007fe8cd13658a in clock_gettime@GLIBC_2.2.5 () from /usr/lib64/libc.so.6
#6  0x00007fe8cf85a962 in edm::WallclockTimer::start() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreUtilities.so
#7  0x00007fe8cfb8104c in edm::SystemTimeKeeper::startModuleEvent(edm::StreamContext const&, edm::ModuleCallingContext const&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#8  0x00007fe8cfbbb017 in edm::stream::EDFilterAdaptorBase::doEvent(edm::EventTransitionInfo const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#9  0x00007fe8cfb9fe6c in edm::WorkerT<edm::stream::EDFilterAdaptorBase>::implDo(edm::EventTransitionInfo const&, edm::ModuleCallingContext const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#10 0x00007fe8cfb27f59 in std::__exception_ptr::exception_ptr edm::Worker::runModuleAfterAsyncPrefetch<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(std::__exception_ptr::exception_ptr, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::TransitionInfoType const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::Context const*) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#11 0x00007fe8cfb284c5 in edm::Worker::RunModuleTask<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >::execute() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#12 0x00007fe8cfa98bae in tbb::detail::d1::function_task<edm::WaitingTaskHolder::doneWaiting(std::__exception_ptr::exception_ptr)::{lambda()#1}>::execute(tbb::detail::d1::execution_data&) () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#13 0x00007fe8ce2d3281 in tbb::detail::r1::task_dispatcher::local_wait_for_all<false, tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7fe8caedbe00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:322
#14 tbb::detail::r1::task_dispatcher::local_wait_for_all<tbb::detail::r1::external_waiter> (waiter=..., t=<optimized out>, this=0x7fe8caedbe00) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.h:458
#15 tbb::detail::r1::task_dispatcher::execute_and_wait (t=<optimized out>, wait_ctx=..., w_ctx=...) at /data/cmsbld/jenkins/workspace/auto-builds/CMSSW_14_1_0_pre1-el8_amd64_gcc12/build/CMSSW_14_1_0_pre1-build/BUILD/el8_amd64_gcc12/external/tbb/v2021.9.0-c3903c50b52342174dbd3a52854a6e6d/tbb-v2021.9.0/src/tbb/task_dispatcher.cpp:168
#16 0x00007fe8cfaa941b in edm::FinalWaitingTask::wait() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#17 0x00007fe8cfab324d in edm::EventProcessor::processRuns() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#18 0x00007fe8cfab37b1 in edm::EventProcessor::runToCompletion() () from /cvmfs/cms.cern.ch/el8_amd64_gcc12/cms/cmssw/CMSSW_14_0_6_MULTIARCHS/lib/el8_amd64_gcc12/scram_x86-64-v3/libFWCoreFramework.so
#19 0x00000000004074ef in tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const ()
#20 0x00007fe8ce2bf9ad in tbb::detail::r1::task_arena_impl::execute (ta=..., d=warning: RTTI symbol not found for class 'tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>'
#21 0x0000000000408ed2 in main::{lambda()#1}::operator()() const ()
#22 0x000000000040517c in main ()

Current Modules:

Module: HLTEcalRecHitInAllL1RegionsProducer:hltRechitInRegionsECAL (crashed)
Module: HLTPrescaler:hltPreDiSC3018EIsoANDHEMass70

A fatal system signal has occurred: segmentation violation
[innocent@lxplus800]/tmp/innocent/CMSSW_14_0_6_MULTIARCHS/src%

from cmssw.

cmsbuild avatar cmsbuild commented on August 23, 2024

New categories assigned: hlt

@Martin-Grunewald,@mmusich you have been requested to review this Pull request/Issue and eventually sign? Thanks

from cmssw.

mmusich avatar mmusich commented on August 23, 2024

reproduced with:

#!/bin/bash -ex

scram p CMSSW CMSSW_14_0_5_patch1
cd CMSSW_14_0_5_patch1/src
eval `scramv1 runtime -sh`

export MALLOC_CONF=junk:true

https_proxy=http://cmsproxy.cms:3128 hltConfigFromDB --runNumber 380115 > hlt_run380115.py
cat <<@EOF >> hlt_run380115.py
from EventFilter.Utilities.EvFDaqDirector_cfi import EvFDaqDirector as _EvFDaqDirector
process.EvFDaqDirector = _EvFDaqDirector.clone(
  buBaseDir = '/eos/cms/store/group/tsg/FOG/error_stream/',
  runNumber = 380115
)
from EventFilter.Utilities.FedRawDataInputSource_cfi import source as _source
process.source = _source.clone(
  fileListMode = True,
  fileNames = (
  '/eos/cms/store/group/tsg/FOG/error_stream/run380115/run380115_ls0338_index000079_fu-c2b03-28-01_pid1451372.raw',
  '/eos/cms/store/group/tsg/FOG/error_stream/run380115/run380115_ls0338_index000104_fu-c2b03-28-01_pid1451372.raw'
  )
)
process.options.wantSummary = True

process.options.numberOfThreads = 32
process.options.numberOfStreams = 24
@EOF

mkdir run380115
cmsRun hlt_run380115.py &> crash_run380115.log

@cms-sw/ecal-dpg-l2 please take a look.

from cmssw.

mmusich avatar mmusich commented on August 23, 2024

Does not seem to happen in MC Relvals

At least it does reproduce on HLT addOnTests on data, see log, preview of cms-sw/cms-bot#2228 output.

from cmssw.

thomreis avatar thomreis commented on August 23, 2024

What is the purpose of export MALLOC_CONF=junk:true?

from cmssw.

fwyzard avatar fwyzard commented on August 23, 2024

What is the purpose of export MALLOC_CONF=junk:true?

https://github.com/jemalloc/jemalloc/wiki/Use-Case:-Find-a-memory-corruption-bug

from cmssw.

mmusich avatar mmusich commented on August 23, 2024

Another reproducer (with a more recent release):

Click me
#!/bin/bash -ex

cd CMSSW_14_0_6_MULTIARCHS/src
eval `scramv1 runtime -sh`

export MALLOC_CONF=junk:true

https_proxy=http://cmsproxy.cms:3128 hltConfigFromDB --runNumber 380466 > hlt_run380466.py
cat <<@EOF >> hlt_run380466.py
from EventFilter.Utilities.EvFDaqDirector_cfi import EvFDaqDirector as _EvFDaqDirector
process.EvFDaqDirector = _EvFDaqDirector.clone(
   buBaseDir = '/eos/cms/store/group/tsg/FOG/error_stream/',
   runNumber = 380466
)
from EventFilter.Utilities.FedRawDataInputSource_cfi import source as _source
process.source = _source.clone(
   fileListMode = True,
   fileNames = (
   '/eos/cms/store/group/tsg/FOG/error_stream/run380466/run380466_ls0276_index000212_fu-c2b03-09-01_pid672001.raw',
   '/eos/cms/store/group/tsg/FOG/error_stream/run380466/run380466_ls0276_index000232_fu-c2b03-09-01_pid672001.raw',
   '/eos/cms/store/group/tsg/FOG/error_stream/run380466/run380466_ls0276_index000246_fu-c2b03-09-01_pid672001.raw'
   )
)

process.options.accelerators = ['cpu']

process.hltOnlineBeamSpotESProducer.timeThreshold = int(1e6)

process.options.wantSummary = True

process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
@EOF


directory="run380466"

# Check if the directory exists
if [ -d "$directory" ]; then
    # If it exists, remove it
    rm -rf "$directory"
fi

# Create the directory
mkdir "$directory"

cmsRun hlt_run380466.py &> crash_run380466.log

The crash described in the issue happens here:

auto this_cell = subDetGeom->getGeometry(recHit.id());

adding a check on the existence of the cell

+         if(this_cell==nullptr)
+           continue;

we get past there, but then there is an exception:

----- Begin Fatal Exception 14-May-2024 10:53:58 CEST-----------------------
An exception of category 'PFEcalEndcapRecHitCreator' occurred while
   [0] Processing  Event run: 380466 lumi: 276 event: 490512491 stream: 0
   [1] Running path 'HLT_Diphoton24_16_eta1p5_R9IdL_AND_HET_AND_IsoTCaloIdT_v8'
   [2] Calling method for module PFRecHitProducer/'hltParticleFlowRecHitECALUnseeded'
Exception Message:
detid 2779096485not found in geometry
----- End Fatal Exception -------------------------------------------------

which matches the exception seen in the relval tests of cms-sw/cms-bot#2228 at cms-sw/cms-bot#2228 (comment).

The main question I have is if 2779096485 is an existing xtal or just junk.

from cmssw.

VinInn avatar VinInn commented on August 23, 2024

from cmssw.

thomreis avatar thomreis commented on August 23, 2024

ECAL detIDs all start with an 8. So junk it is indeed.

from cmssw.

mmusich avatar mmusich commented on August 23, 2024

ECAL detIDs all start with an 8. So junk it is indeed.

adding:

diff --git a/RecoLocalCalo/EcalRecProducers/plugins/EcalRecHitProducer.cc b/RecoLocalCalo/EcalRecProducers/plugins/EcalRecHitProducer.cc
index 56a9292da36..ff352a772f8 100644
--- a/RecoLocalCalo/EcalRecProducers/plugins/EcalRecHitProducer.cc
+++ b/RecoLocalCalo/EcalRecProducers/plugins/EcalRecHitProducer.cc
@@ -261,6 +261,18 @@ void EcalRecHitProducer::produce(edm::Event& evt, const edm::EventSetup& es) {
   LogInfo("EcalRecHitInfo") << "total # EB calibrated rechits: " << ebRecHits->size();
   LogInfo("EcalRecHitInfo") << "total # EE calibrated rechits: " << eeRecHits->size();
 
+  // Loop over EBRecHitCollection
+  for (const auto& ebRecHit : *ebRecHits) {
+    DetId detId = ebRecHit.detid(); // Get the DetId
+    std::cout << "EB DetId: " << detId.rawId() << std::endl; // Print the rawId of the DetId
+  }
+
+  // Loop over EERecHitCollection
+  for (const auto& eeRecHit : *eeRecHits) {
+    DetId detId = eeRecHit.detid(); // Get the DetId
+    std::cout << "EE DetId: " << detId.rawId() << std::endl; // Print the rawId of the DetId
+  }
+
   evt.put(ebRecHitToken_, std::move(ebRecHits));
   evt.put(eeRecHitToken_, std::move(eeRecHits));
 }

I get:

EB DetId: 838861323

[...]

EE DetId: 872443444
EE DetId: 872443548
EE DetId: 872443698
EE DetId: 872444201
EE DetId: 872444471
EE DetId: 2779096485
EE DetId: 2779096485
EE DetId: 2779096485
EE DetId: 2779096485
EE DetId: 2779096485
EE DetId: 2779096485
EE DetId: 2779096485
EE DetId: 2779096485
EE DetId: 2779096485
EE DetId: 2779096485
EE DetId: 2779096485
EE DetId: 2779096485
EE DetId: 2779096485
EE DetId: 2779096485
EE DetId: 2779096485

so it looks like EcalRecHitProducer is filling the end of EE rechit collection with junk.

from cmssw.

VinInn avatar VinInn commented on August 23, 2024

is most probably leaving uninitialized or coping from uninitialized memory, as it is jmalloc that is filling it with "junk"
Do not understand why valgrind is not reporting it...

from cmssw.

thomreis avatar thomreis commented on August 23, 2024

I need to check but I suspect that these extra junk RecHits are already in the input collection of the EcalRecHitProducer. The question is if there are extra digis already or if the junk gets added in the multifit or the EcalUncalibRecHitSoAToLegacy.

from cmssw.

mmusich avatar mmusich commented on August 23, 2024

I need to check but I suspect that these extra junk RecHits are already in the input collection of the EcalRecHitProducer.

They are:

diff --git a/RecoLocalCalo/EcalRecProducers/plugins/EcalRecHitProducer.cc b/RecoLocalCalo/EcalRecProducers/plugins/EcalRecHitProducer.cc
index 56a9292da36..b7e5fd4ed66 100644
--- a/RecoLocalCalo/EcalRecProducers/plugins/EcalRecHitProducer.cc
+++ b/RecoLocalCalo/EcalRecProducers/plugins/EcalRecHitProducer.cc
@@ -146,11 +146,31 @@ void EcalRecHitProducer::produce(edm::Event& evt, const edm::EventSetup& es) {
     const auto& eeUncalibRecHits = evt.get(eeUncalibRecHitToken_);
     LogDebug("EcalRecHitDebug") << "total # EE uncalibrated rechits: " << eeUncalibRecHits.size();
 
+    // Loop over uncalib EERecHitCollection
+    for (const auto& eeRecHit : eeUncalibRecHits) {
+      DetId detId = eeRecHit.id(); // Get the DetId
+
+      // Check if the rawId corresponds to 2779096485
+      if (detId.rawId() == 2779096485) {
+        std::cout << "EE Uncalib -DetId: " << detId.rawId() << " - Line: " << __LINE__ << std::endl;
+      }
+    }
+
     // loop over uncalibrated rechits to make calibrated ones
     for (const auto& uncalibRecHit : eeUncalibRecHits) {
       worker_->run(evt, uncalibRecHit, *eeRecHits);
     }

yields:

EE Uncalib -DetId: 2779096485 - Line: 155
EE Uncalib -DetId: 2779096485 - Line: 155
EE Uncalib -DetId: 2779096485 - Line: 155
EE Uncalib -DetId: 2779096485 - Line: 155
EE Uncalib -DetId: 2779096485 - Line: 155
EE Uncalib -DetId: 2779096485 - Line: 155
EE Uncalib -DetId: 2779096485 - Line: 155
EE Uncalib -DetId: 2779096485 - Line: 155
EE Uncalib -DetId: 2779096485 - Line: 155
EE Uncalib -DetId: 2779096485 - Line: 155
EE Uncalib -DetId: 2779096485 - Line: 155
EE Uncalib -DetId: 2779096485 - Line: 155
EE Uncalib -DetId: 2779096485 - Line: 155
EE Uncalib -DetId: 2779096485 - Line: 155
EE Uncalib -DetId: 2779096485 - Line: 155

from cmssw.

VinInn avatar VinInn commented on August 23, 2024

This is NOT enough to crash

diff --git a/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalUncalibRecHitProducerPortable.cc b/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalUncalibRecHitProducerPortable.cc
index be36fad7b19..c5cb1c169f4 100644
--- a/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalUncalibRecHitProducerPortable.cc
+++ b/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalUncalibRecHitProducerPortable.cc
@@ -191,6 +191,13 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
     // output device collections
     OutputProduct uncalibRecHitsDevEB{ebDigisSize, queue};
     OutputProduct uncalibRecHitsDevEE{eeDigisSize, queue};
+    {
+      auto eb = uncalibRecHitsDevEB.buffer(); alpaka::memset(queue, eb, 0xa5);
+      auto ee = uncalibRecHitsDevEE.buffer(); alpaka::memset(queue, ee, 0xa5);
+      alpaka::wait(queue);
+    }

from cmssw.

thomreis avatar thomreis commented on August 23, 2024

Is alpaka::wait(queue); required after memset? There are two places where memset is used without the wait at the moment.
Adding it does not fix the crashes so just a general question.

from cmssw.

fwyzard avatar fwyzard commented on August 23, 2024

If the copy uses a host queue the wait is not needed, because the host queues are blocking by default. But it also does not harm, as it shouldn't do anything.

For a host-to-device copy using a device queue, it's needed before the data can be accessed on the device using a different stream.

For a device-to-host copy using device queue, it's needed before the data can be accessed on the host.

That said, we have seen that for small memory copies, the GPU runtime seems to work OK also without the wait...

from cmssw.

makortel avatar makortel commented on August 23, 2024

For a host-to-device copy using a device queue, it's needed before the data can be accessed on the device using a different stream.

Just to clarify that if the "different queue" happens because the second devices-ide access is in a different EDModule, the framework adds necessary synchronization so that explicit alpaka::wait() is not needed. (this is hopefully much more common case than EDModules using multiple queues explicitly)

from cmssw.

mmusich avatar mmusich commented on August 23, 2024

@thomreis @cms-sw/ecal-dpg-l2 please clarify if there is progress on this issue and if there is someone actively working on solving it.
Thank you

from cmssw.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.