Coder Social home page Coder Social logo

Comments (15)

makortel avatar makortel commented on August 23, 2024 1

assign core,pdmv

from cmssw.

cmsbuild avatar cmsbuild commented on August 23, 2024

cms-bot internal usage

from cmssw.

cmsbuild avatar cmsbuild commented on August 23, 2024

A new Issue was created by @ArturAkh.

@smuzaffar, @antoniovilela, @sextonkennedy, @rappoccio, @makortel, @Dr15Jones can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

from cmssw.

cmsbuild avatar cmsbuild commented on August 23, 2024

New categories assigned: core,pdmv

@Dr15Jones,@AdrianoDee,@sunilUIET,@miquork,@makortel,@smuzaffar you have been requested to review this Pull request/Issue and eventually sign? Thanks

from cmssw.

makortel avatar makortel commented on August 23, 2024

So far, we at KIT observe, that it seems like there is a difference in memory usage when running a el8 container on an el7 host (lower usage) compared to an el8/el9 host (higher usage). However, we don't have a concrete explanation for this observation so far. Investigation at KIT is still ongoing.

We have another report too where the same job on EL9 host uses significantly more (order of 20 %) more RSS than EL8 or EL7 host #42929

from cmssw.

makortel avatar makortel commented on August 23, 2024
root://cmsdcache-kit-disk.gridka.de//store/unmerged/logs/prod/2024/5/23/pdmvserv_Run2024C_JetMET1_ECALRATIO_240521_151559_1672/DataProcessing/0003/1/d4985bf1-e3c4-4eb9-86ba-32470fb02eed-229-1-logArchive.tar.gz

Looking from the log, the job was ran on CMSSW_14_0_7. The first MemoryCheck printout was

%MSG-w MemoryCheck:  source 23-May-2024 15:45:59 CEST PreSource
MemoryCheck: module source:source VSIZE 9637.79 0 RSS 6785.51 3.91797
%MSG
Begin processing the 5th record. Run 380005, Event 903899790, LumiSection 535 on stream 1 at 23-May-2024 15:45:59.648 CEST

6.6 GB sounds a lot for a 4-thread job, compared to the expectation of 2 GB/thread. From the printouts I see DQM modules being included in the job.

from cmssw.

ArturAkh avatar ArturAkh commented on August 23, 2024

Looking from the log, the job was ran on CMSSW_14_0_7. The first MemoryCheck printout was

%MSG-w MemoryCheck:  source 23-May-2024 15:45:59 CEST PreSource
MemoryCheck: module source:source VSIZE 9637.79 0 RSS 6785.51 3.91797
%MSG
Begin processing the 5th record. Run 380005, Event 903899790, LumiSection 535 on stream 1 at 23-May-2024 15:45:59.648 CEST

6.6 GB sounds a lot for a 4-thread job, compared to the expectation of 2 GB/thread. From the printouts I see DQM modules being included in the job.

Yes, indeed. At the end, the memory report states that the peak RSS size is well above 8GB:

MemoryReport> Peak rss size 8317.52 Mbytes
 Key events increasing rss:
[900] run: 380005 lumi: 535 event: 902919827  vsize = 14651.1 deltaVsize = 0 rss = 8294.34 delta = 100.824
[904] run: 380005 lumi: 535 event: 903727220  vsize = 14638.9 deltaVsize = -12.3555 rss = 8316.61 delta = -0.90625
[903] run: 380005 lumi: 535 event: 903241610  vsize = 14651.1 deltaVsize = -0.105469 rss = 8317.52 delta = 23.1836
[9] run: 380005 lumi: 535 event: 904518024  vsize = 11789.4 deltaVsize = 674.512 rss = 8001.11 delta = 119.016
[12] run: 380005 lumi: 535 event: 903713728  vsize = 11795.4 deltaVsize = 1.89453 rss = 8171.79 delta = 160.375
[5] run: 380005 lumi: 535 event: 903383179  vsize = 11046.9 deltaVsize = 1402.75 rss = 7808.14 delta = 1024.64
MessageLogger: dropped waiting message count 408

from cmssw.

makortel avatar makortel commented on August 23, 2024

If I dug around correctly, the cmsDriver is here
https://cmsweb.cern.ch/couchdb/reqmgr_config_cache/476e400f0676a4c134fdd8b6b87845df/configFile
It contains --eventcontent ALCARECO,AOD,MINIAOD,DQM --step RAW2DIGI,L1Reco,RECO,PAT,,SKIM:@JetMET1,DQM:@rerecoCommon+@hcal2 --procModifiers gpuValidationEcal

So DQM is clearly involved. I'm wondering the purpose of --procModifiers gpuValidationEcal because otherwise there doesn't seem to be anything GPU-related in the job.

The job contains total of about 2500 modules, of which 11 are OutputModules. OutputModules need memory for buffering, so each OutputModule increases memory need.

from cmssw.

makortel avatar makortel commented on August 23, 2024

I ran some events of the example through IgProf memory profiler (on EL7) on one thread/stream. Here is the heap state after 5 events
https://mkortela.web.cern.ch/mkortela/cgi-bin/navigator/issue45028/test_07.5_live

Some observations

from cmssw.

makortel avatar makortel commented on August 23, 2024

One conclusion is that this kind of job would have higher chance to fit the memory limit if run with 8 threads.

from cmssw.

ArturAkh avatar ArturAkh commented on August 23, 2024

After extracting info from logs:

https://cernbox.cern.ch/s/p9awVmSt7nySHPV

The following plot for Peak RSS could be created:

rss_histogram

Thereby a grouping by Host OS is done:

  • el7 is mostly KIT Tier 1 with a very few FNAL slots
  • el8 is KIT-HoreKa subsite
  • el9 is FNAL

from cmssw.

RHofsaess avatar RHofsaess commented on August 23, 2024

The corresponding logs are archived here: xrdfs root://cmsdcache-kit-disk.gridka.de/ ls /store/user/rhofsaess/LOGS/KIT_FNAL_Run2024C_logs.tar.gz
This includes all logs, failed and successful, for the campaigns: pdmvserv_Run2024C_EGamma0..., pdmvserv_Run2024C_JetMET0..., and pdmvserv_Run2024C_JetMET1... from KIT (T1 and KIT-HoreKa subsite) and FNAL for 21.+22.05.24.

Note: Those are many files and untar'ing may take a while 😄

$ xrdfs root://cmsdcache-kit-disk.gridka.de/ stat /store/user/rhofsaess/LOGS/KIT_FNAL_Run2024C_logs.tar.gz

Path:   /store/user/rhofsaess/LOGS/KIT_FNAL_Run2024C_logs.tar.gz
Id:     0
Size:   50219471028
MTime:  2024-05-24 14:02:26
Flags:  48 (IsReadable|IsWritable)

from cmssw.

makortel avatar makortel commented on August 23, 2024

I ran the job with 80 first events on 4 threads on lxplus[89], directly and via el[98] containers, and with jemalloc, TCMalloc, and GlibC malloc. I extracted the peak RSS and VSIZE. The numbers are from 1 execution, so any potential statistical fluctuation is ignored.

RSS

el8 host el9 host
Jemalloc, el8 binary 6973.75 6664.13
Jemalloc, el9 binary 6783.95 7070.38
TCMalloc, el8 binary 7280.86 7260.79
TCMalloc, el9 binary 7198.72 7366
GlibC, el8 binary 7585.84 7654.41
GlibC, el9 binary 7604.34 7765.85

VSIZE

el8 host el9 host
Jemalloc, el8 binary 12790.8 12504
Jemalloc, el9 binary 12719.1 12475.7
TCMalloc, el8 binary 10084.9 9898.71
TCMalloc, el9 binary 9846.76 9991.8
GlibC, el8 binary 10745.3 10695.6
GlibC, el9 binary 10735.1 10806.2

This test did not show sizable difference between el8 and el9 hosts. Maybe the mechanism that leads to higher memory usage on el9 hosts has more dynamic nature than what my simple test exercised?

from cmssw.

ArturAkh avatar ArturAkh commented on August 23, 2024

Hi @makortel,

I think it would be worth testing the full processing, not only the first few events. It might be, that some particular events are causing this rise in RSS peak.

Furthermore, as far as I understood, you have tested an el8/el9 container on el8/el9 hosts, right?

At least in the context of the issue discussed here, the problem is rather that having the el7 host + el8/el9 container combination is consuming less memory, than what you have tested. So I'd say your results are as expected.

Cheers,

Artur

from cmssw.

makortel avatar makortel commented on August 23, 2024

I think it would be worth testing the full processing, not only the first few events. It might be, that some particular events are causing this rise in RSS peak.

Perhaps, but that would take days to test. I had time only for something relatively quick, and wanted to see if there was a reproducible systematic difference between el8 and el9.

Furthermore, as far as I understood, you have tested an el8/el9 container on el8/el9 hosts, right?

Correct.

At least in the context of the issue discussed here, the problem is rather that having the el7 host + el8/el9 container combination is consuming less memory, than what you have tested.

Ah, I guess I missed this point that the difference was el7 vs el8/el9 host (possibly because in the earlier case #42929 it was el9 specifically that made the difference).

from cmssw.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.