Coder Social home page Coder Social logo

Valgrind reports plenty of "Uninitialised value" created by a stack allocation in an Alpaka-kernel in pluginRecoLocalCaloEcalRecProducersPluginsPortableSerialSync.so about cmssw HOT 25 OPEN

VinInn avatar VinInn commented on August 23, 2024
Valgrind reports plenty of "Uninitialised value" created by a stack allocation in an Alpaka-kernel in pluginRecoLocalCaloEcalRecProducersPluginsPortableSerialSync.so

from cmssw.

Comments (25)

mmusich avatar mmusich commented on August 23, 2024 1

assign reconstruction, heterogeneous

from cmssw.

fwyzard avatar fwyzard commented on August 23, 2024 1

hi @makortel you are right, it's a regular data member, not a static one (sorry I mixed things up)

from cmssw.

cmsbuild avatar cmsbuild commented on August 23, 2024

cms-bot internal usage

from cmssw.

cmsbuild avatar cmsbuild commented on August 23, 2024

A new Issue was created by @VinInn.

@smuzaffar, @antoniovilela, @rappoccio, @Dr15Jones, @makortel, @sextonkennedy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

from cmssw.

cmsbuild avatar cmsbuild commented on August 23, 2024

New categories assigned: reconstruction,heterogeneous

@fwyzard,@jfernan2,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

from cmssw.

makortel avatar makortel commented on August 23, 2024

@cms-sw/ecal-dpg-l2 @thomreis

from cmssw.

fwyzard avatar fwyzard commented on August 23, 2024

Looking at the code, the only allocation I see inside the Kernel_time_computation_init kernel is the shared memory (which on the serial CPU back-end is just plain memory).

@VinInn, as a simple test, could you try adding something like

for (auto i: cms::alpakatools::uniform_elements(acc, 2*elemsPerBlock)) {
  shrSampleValues[i] = 0;
}
alpaka::syncBlockThreads(acc);

around line 804, and check if that makes valgrind happy ?

@thomreis, could you check if that has any impact on the results ?

from cmssw.

VinInn avatar VinInn commented on August 23, 2024

I can try but valgrind is saying
created by a stack allocation
not
heap

from cmssw.

fwyzard avatar fwyzard commented on August 23, 2024

Alpaka allocates "shared memory" for the CPU backend as a static data member of a class templated on the kernel type... so that may count as a stack allocation ?

from cmssw.

VinInn avatar VinInn commented on August 23, 2024

most probably yes...
one may check @ TaskKernelCpuSerial.hpp:51

from cmssw.

VinInn avatar VinInn commented on August 23, 2024

zeroing shrSampleValues does not help
WAIT, maybe not recompiled....

from cmssw.

VinInn avatar VinInn commented on August 23, 2024

I confirm that the valgrind issue is still there.
I set the memory to "junk"

      memset(shrSampleValues,0xa5,2*elemsPerBlock*sizeof(ScalarType));
      alpaka::syncBlockThreads(acc);

at least it does not crash....

from cmssw.

makortel avatar makortel commented on August 23, 2024

Alpaka allocates "shared memory" for the CPU backend as a static data member of a class templated on the kernel type

@fwyzard Could you give more details? I'm now wondering how a static data member would work with running multiple instances of the kernel concurrently (naively I'd assume a data race). Poking around I found
https://github.com/alpaka-group/alpaka/blob/1.1.0/include/alpaka/block/shared/dyn/BlockSharedMemDynMember.hpp#L83
as the memory block for both static and dynamic shared memory (that is not class-static), but I could have easily missed something.

from cmssw.

VinInn avatar VinInn commented on August 23, 2024

I actually overlooked all these other ones

=865737==
==865737== Conditional jump or move depends on uninitialised value(s)
==865737==    at 0xBA252001: UnknownInlinedFun (MultifitComputations.h:488)
==865737==    by 0xBA252001: UnknownInlinedFun (AmplitudeComputationKernels.dev.cc:226)
--
==865737==  Uninitialised value was created by a heap allocation
==865737==    at 0x48455F7: operator new(unsigned long, std::align_val_t) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/external/valgrind/3.22.0-390bf50f6ee4c321d331c491bff126fd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==865737==    by 0x70CE9C8D: UnknownInlinedFun (AlignedAlloc.hpp:16)
--
==865737== Conditional jump or move depends on uninitialised value(s)
==865737==    at 0xBA25200A: UnknownInlinedFun (MultifitComputations.h:490)
==865737==    by 0xBA25200A: UnknownInlinedFun (AmplitudeComputationKernels.dev.cc:226)
--
==865737==  Uninitialised value was created by a heap allocation
==865737==    at 0x48455F7: operator new(unsigned long, std::align_val_t) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/external/valgrind/3.22.0-390bf50f6ee4c321d331c491bff126fd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==865737==    by 0x70CE9C8D: UnknownInlinedFun (AlignedAlloc.hpp:16)
--
==865737== Conditional jump or move depends on uninitialised value(s)
==865737==    at 0xBA2522B0: UnknownInlinedFun (MultifitComputations.h:428)
==865737==    by 0xBA2522B0: UnknownInlinedFun (AmplitudeComputationKernels.dev.cc:226)
--
==865737==  Uninitialised value was created by a heap allocation
==865737==    at 0x48455F7: operator new(unsigned long, std::align_val_t) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/external/valgrind/3.22.0-390bf50f6ee4c321d331c491bff126fd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==865737==    by 0x70CE9C8D: UnknownInlinedFun (AlignedAlloc.hpp:16)
--
==865737== Conditional jump or move depends on uninitialised value(s)
==865737==    at 0xBA2522EC: UnknownInlinedFun (MultifitComputations.h:435)
==865737==    by 0xBA2522EC: UnknownInlinedFun (AmplitudeComputationKernels.dev.cc:226)
--
==865737==  Uninitialised value was created by a heap allocation
==865737==    at 0x48455F7: operator new(unsigned long, std::align_val_t) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/external/valgrind/3.22.0-390bf50f6ee4c321d331c491bff126fd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==865737==    by 0x70CE9C8D: UnknownInlinedFun (AlignedAlloc.hpp:16)
--
==865737== Conditional jump or move depends on uninitialised value(s)
==865737==    at 0xBA252EDA: UnknownInlinedFun (AmplitudeComputationKernels.dev.cc:243)
==865737==    by 0xBA252EDA: UnknownInlinedFun (invoke.h:61)
--
==865737==  Uninitialised value was created by a heap allocation
==865737==    at 0x48455F7: operator new(unsigned long, std::align_val_t) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/external/valgrind/3.22.0-390bf50f6ee4c321d331c491bff126fd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==865737==    by 0x70CE9C8D: UnknownInlinedFun (AlignedAlloc.hpp:16)
--
==865737== Conditional jump or move depends on uninitialised value(s)
==865737==    at 0xBA264B7F: UnknownInlinedFun (TimeComputationKernels.h:1030)
==865737==    by 0xBA264B7F: UnknownInlinedFun (invoke.h:61)
--
==865737==  Uninitialised value was created by a heap allocation
==865737==    at 0x48455F7: operator new(unsigned long, std::align_val_t) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/external/valgrind/3.22.0-390bf50f6ee4c321d331c491bff126fd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==865737==    by 0x70CE9C8D: UnknownInlinedFun (AlignedAlloc.hpp:16)
--
==865737== Conditional jump or move depends on uninitialised value(s)
==865737==    at 0xBA264C5B: UnknownInlinedFun (TimeComputationKernels.h:1071)
==865737==    by 0xBA264C5B: UnknownInlinedFun (invoke.h:61)
--
==865737==  Uninitialised value was created by a heap allocation
==865737==    at 0x48455F7: operator new(unsigned long, std::align_val_t) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/external/valgrind/3.22.0-390bf50f6ee4c321d331c491bff126fd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==865737==    by 0x70CE9C8D: UnknownInlinedFun (AlignedAlloc.hpp:16)
--
==865737== Conditional jump or move depends on uninitialised value(s)
==865737==    at 0xA2E1F648: UnknownInlinedFun (EcalRecHit.h:128)
==865737==    by 0xA2E1F648: EcalRecHitSimpleAlgo::makeRecHit(EcalUncalibratedRecHit const&, float const&, float const&, unsigned int const&) const (EcalRecHitSimpleAlgo.h:46)
--
==865737==  Uninitialised value was created by a heap allocation
==865737==    at 0x48455F7: operator new(unsigned long, std::align_val_t) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02836/el9_amd64_gcc12/external/valgrind/3.22.0-390bf50f6ee4c321d331c491bff126fd/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==865737==    by 0x70CE9C8D: UnknownInlinedFun (AlignedAlloc.hpp:16)
--

who may be also related to #44956

from cmssw.

VinInn avatar VinInn commented on August 23, 2024

looking into the detailed traceback I see this

alpaka_cuda_async::EcalMultifitConditionsHostESProducer::produce(EcalMultifitConditionsRcd const&) (EcalMultifitConditionsHostESProducer.cc:74)

from cmssw.

VinInn avatar VinInn commented on August 23, 2024

it seems that adding

diff --git a/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc b/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc
index ddba855853c..96a7792e9cb 100644
--- a/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc
+++ b/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc
@@ -72,6 +72,10 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
       size_t numberOfXtals = pedestalsData.size();

       auto product = std::make_unique<EcalMultifitConditionsHost>(numberOfXtals, cms::alpakatools::host());
+      {
+         alpaka::QueueCpuBlocking queue{cms::alpakatools::host()};
+         auto buff = product->buffer(); alpaka::memset(queue,buff, 0xa5);
+      }
       auto view = product->view();

       // Filling pedestals

it does not crash and all messages from Valgrind mentioned above disappear.
Of course this is not necessarily a good news...

from cmssw.

fwyzard avatar fwyzard commented on August 23, 2024

Reading the code of the ESProducer, the only field that seems to not be filled is the rawid.

@VinInn could you check if adding

diff --git a/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc b/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc
index ddba855853c..d0ee97230ec 100644
--- a/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc
+++ b/RecoLocalCalo/EcalRecProducers/plugins/alpaka/EcalMultifitConditionsHostESProducer.cc
@@ -91,6 +91,7 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
 
       for (unsigned int i = 0; i < barrelSize; ++i) {
         auto vi = view[i];
+        vi.rawid() = 0;
 
         vi.pedestals_mean_x12() = pedestalsEB[i].mean_x12;
         vi.pedestals_rms_x12() = pedestalsEB[i].rms_x12;
@@ -113,6 +114,7 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE {
       }  // end Barrel loop
       for (unsigned int i = 0; i < endcapSize; ++i) {
         auto vi = view[barrelSize + i];
+        vi.rawid() = 0;
 
         vi.pedestals_mean_x12() = pedestalsEE[i].mean_x12;
         vi.pedestals_rms_x12() = pedestalsEE[i].rms_x12;

instead of the memset() also makes the valgrind messages go away ?

@thomreis should EcalMultifitConditionsHostESProducer fill the rawid ?

from cmssw.

fwyzard avatar fwyzard commented on August 23, 2024

Mhm, grep does not find any mention of rawid in RecoLocalCalo/EcalRecProducers/ - so it doesn't seem to be used anywhere.

Then, I don't understand what un-initialised memory valgrind is complaining about :-(

from cmssw.

VinInn avatar VinInn commented on August 23, 2024

no it is not enough...
Is not that there are some "padding space" for alignment and somehow it is accessed?

from cmssw.

fwyzard avatar fwyzard commented on August 23, 2024

Yes, there is padding between the columns... but it should not be accessed, unless the indices used to access the SoA are larger than its nominal capacity.

from cmssw.

VinInn avatar VinInn commented on August 23, 2024

in RecoLocalCalo/EcalRecProducers/plugins/alpaka/TimeComputationKernels.h
pedestal can be accessed with the channel index

   849            pedestal = conditionsDev.pedestals_mean_x12()[ch];

while everywhere else is accessed with the HashedIndex
@thomreis : is this intended or is a bug?

869 auto const mean_x6 = conditionsDev.pedestals_mean_x6()[hashedId];
870 auto const rms_x6 = conditionsDev.pedestals_rms_x6()[hashedId];
871 auto const gain12Over6 = conditionsDev.gain12Over6()[hashedId];
872 sample_value = (static_cast(adc) - mean_x6) * gain12Over6;
873 sample_value_error = rms_x6 * gain12Over6;

etc

in addition the hasindex is computed from the digi rawid
833 auto const hashedId = isBarrel ? ecal::reconstruction::hashedIndexEB(did.rawId())
834 : offsetForHashes + ecal::reconstruction::hashedIndexEE(did.rawId());

and if did.rawId() is junk (as apparently can be, as seen in the companion issue) hashedId can easily be anything....

from cmssw.

fwyzard avatar fwyzard commented on August 23, 2024

Very good points.

We have the possibility of enabling runtime checks on the indices, let me dig how to do it.
I actually think we should actually enable it by default, if the cost is not too high.

from cmssw.

fwyzard avatar fwyzard commented on August 23, 2024

Can you try running with #44987 ?

from cmssw.

VinInn avatar VinInn commented on August 23, 2024

what is supposed to happen?
anyhow no message

from cmssw.

fwyzard avatar fwyzard commented on August 23, 2024

It's supposed to fail an assert or throw an exception if there is an out of bounds access (and if I made the correct changes).

from cmssw.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.