cms-analysis / flashgg Goto Github PK

View Code? Open in Web Editor NEW

20.0 20.0 158.0 278.85 MB

Python 42.78% HTML 0.95% Shell 0.48% C++ 48.31% C 7.48%

flashgg's People

Contributors

Stargazers

Watchers

Forkers

cms-flashgg sethzenz olivierbondu favaro mdonega fcouderc pmeridian ldcorpe musella innakucher malcles matteosan1 mplaner crovelli yhaddad ferriff swagata87 kmondal arnabpurohit camilocarrillo resonanthbbhgg j-c-wright quittnat nancymarinelli khoumani hzgamma lgray fedmante andreypz martinamalberti vtavolar mzientek ffede cmstthgganalysis abeschi mmachet simonepigazzini edjtscott camendola lsoffi kmcdermo fabriciojm andreh7 tklijnsma ishvetso forthommel hbakhshi gourangakole rdangovs fravera digideskio giulianegro phylsix bcourbon werbellin dfigueiredo gkrintir michelif vciriolo apsallid rateixei kskovpen saghosh shervin86 prasant1993 zhangzc11 abenagli cbbrainerd skaplanhex doanhien nadya-chernyavskaya threiten qyguo debabratabhowmik dmajumder mgouzevi bjmarsh panwarlsweet sam-may nsahoo prafullasaha prolaymal mhl0116 rchatter benitezj mhuwiler thongonary arceon09 jonathon-langford atishelmanch flaviacetorelli tommasoisi joe-w-davies ccamen amandipde rgoldouz sijingzhang neuanalyses bmarzocc gbyu

flashgg's Issues

VHTightTagProducer seg fault

These lines reference the 0th and 1st elements of tagElectrons_highPt without checking if they exist:

https://github.com/cms-analysis/flashgg/blob/master/Taggers/plugins/VHTightTagProducer.cc#L364-L365

I think all references to tagElectrons_highPt should be tagElectrons_lowPt

Clean up configurations and analysis scripts

Except maybe for a few things like simple_Producer_test.py and simple_Tag_test.py, we should move configurations and analysis scripts to a common place outside the producer directories. (Perhaps a folder under Commissioning?) I think the divide should be as follows:

Well-defined validation procedure uses a few configurations/scripts in producer test directories. Validation procedure and expected output is documented and used to be sure that new code doesn't break anything.
All others are moved.

unused beamspot

MicroAODProducers/test/simple_Producer_test.py

This is a leftover, not used anymore:
L28 BeamSpotTag=cms.untracked.InputTag('offlineBeamSpot'),

Add examples of workspace/ntuple dumping to simple_Tag_test.py

Currently, in evaluating PR's that I'm not directly familiar with, I rely heavily on the standard MicroAOD followed by running Taggers/test/simple_Tag_test.py. It's clear that a lot of the dumpers etc aren't being tested, and this is often the functionality that new PR's are changing. I would welcome either an update to the tag test or suggestions for additional standard jobs to use to exercise everything that might break.

Tags: interleaved sorting

Allow, for example, VBF 0 > Untagged 0 > VBF 1 > ... Configure TagSorter with VPSet list of tuple(tagName,minCat,maxCat).

Current master is crashing

if you try to run on more than 20 events

Systematics: single workspace for all systs

Photons already have a suitable label in the new PR, but the dumper and configuration need to be updated to use it.

Change object format for DzVertexMap object

TLDR: we should get rid of maps we put in the event. I will take care of this but let me know if you need a map and I can help you replace it. More details below if you're curious.

I have been advised by CMSSW memory management experts that maps can create issues or be unstable, and that map<T,vector<U> > is especially bad. [Here T=Ptr<vertex> and U=PackedCandidate.] The use of this construct for a one-to-many map can be replaced with some clever handling of vector<pair<T,U> >, in which all the U's corresponding to a given T are in sequence and you rely on that to look at only the right part of the vector.

Also, for reference, a one-to-one map<T,U> should simply be replaced with a vector for which you rely on the fact that the new collection is in the same order as the old. I had significant trouble creating the map anyway, because CMSSW implicitly requires a vector<T> dictionary for a map<T,U> even if you don't think you're using vector<T>. I eventually got rid of the map (in code I will do a pull request for later this morning) and just used the vector<U>.

Modify README to clone upstream, not personal repo

As the instruction are, the users will clone the master repo on their own branch and then they add cms-analysi/flashgg as upstream.

This is error prone because in general the user's master branch will not be in sync with the flashgg one.

The instructions should be modified such that the user's clone the flashgg master. They can then set the upstream repo to their personal one.

I can make the necessary changes if people are fine.

Diphoton candidates are stored twice

Hi,

for now the diphoton candidates are stored twice in the final diphoton collection:
https://github.com/cms-analysis/flashgg/blob/master/MicroAODProducers/plugins/DiPhotonProducer.cc#L87
and
https://github.com/cms-analysis/flashgg/blob/master/MicroAODProducers/plugins/DiPhotonProducer.cc#L95

We are currently working with @swagata87 on implementing the photon 4-momentum kinematics changes due to the vertex assignment around these lines. So we will probably correct this whenever we are ready to pull-request, this issue is to make sure people are aware of this in the meantime.

cleanup ?

In MicroAODProducers/test/simple_Producer_test.py I would move

process.flashggVertexMapUnique = cms.EDProducer('FlashggDzVertexMapProducer',
…
process.flashggVertexMapNonUnique = cms.EDProducer('FlashggDzVertexMapProducer',
…
process.flashggJets = cms.EDProducer('FlashggJetProducer',

in python fragments in /python

MicroAOD size tuning

Current content (as of phys14 v2 production) is ~50% of MiniAOD.

We need to achieve a factor 2.5 reduction in order to meet the goal 1/5 goal.

I made a small excercie to see what could be done to meet the target.
https://musella.web.cern.ch/musella/higgs/flashgg/miniAOD.xls

Summary is that we need to:

avoid photon info duplication in di-photon
we should preselect leptons
drop cluster information and store subset in photons
only store photon ID info for vertexes associated with at least one di-photon pair
reduce size of jet collection

prepareCrabJobs.py

flashgg/MetaData/work/prepareCrabJobs.py

108 if options.dumpCfg:
109 print ( dumpCfg(cfg) )
110 exit(0)

it should be:
109 print ( dumpCfg(options) )

Python-based tree-making for quick studies/analysis

Include gen-level objects. Allow string-based python configuration.

Consistency of capitalization in tag producer names

ConversionTools for patElectrons

Two todo items:

Adapt flashgg to addition of conversion tools set up by @InnaKucher so that they can use Handle<vector<pat::Electron>> rather than just gsfElectrons:

cms-sw/cmssw@CMSSW_7_2_2_patch2...sethzenz:topic-conversion-tools-for-pat-ele

Consider porting and putting into CMSSW in the future (post 74X).

Keep all vertexes info in VertexSelector

In order to be able to perform a training of the vertex selection algorithm, the information about all vertexes should be stored in the DiPhoton candidates.

For space reasons, one could limit the total number of vertexes to be stored, but the functionalities of the algorithms and the data format need to be extended.

"Reroute" diphoton processing in tag step

The default tag sequence currently uses the DiPhotonCollection directly as input. I propose to switch it to start with the preselectedDiPhotonCollection and use the central value of the new systematics producer. This should not have any drastic downstream effects, and has to be done anyway, but I will check that the tag output is more or less the same. Any other comments/concerns?

Move dumper classes to UtilAlgos

And (di-)photon classes/configurations from TagAlgos(Producers) to MicroAODAlgos(Producers)

Add functionality to histograms filling in dumper

Add the possibility to fill a single histograms with more than 1 input variable e.g.
"subIDMVA,leadIDMV>>idmva(100,-0.3, 0.6)"

./prepareCrabJobs.py --load <previous_config.json>

The "--load" features doesn't work because of two bugs in optpars_utils.py

The first one is:
L32 if origin:
L33 origin += ",%s",value

should be

L32 if origin:
L33 origin += " "
L34 origin += ''.join(value)

The second is:
L50 if attr and type(attr) == list:
L51 attr.extend(v)
L52 setter(dest,k,v)

L50 if attr and type(attr) == list:
L51 attr.extend(v)
L52 setter(dest,k,attr)
L53 else:
L54 setter(dest,k,v)

Switch to PFCHSLeg jets

With the improved track-vertex association for PFCHS collection building (see talk linked from https://twiki.cern.ch/twiki/bin/viewauth/CMS/FLASHggFramework#2015_03_02 ), the PFCHSLeg jet collection is ok. I propose to move this to the default for the tags that use jets -- not as a final choice, but it's sensible in a way that PFCHS0 is not. We should watch carefully that this does not create memory problems on the grid, but I have run successfully with the default 2 GB. Any comments/concerns on making this change in the default sequence?

Implement photon energy scale corrections and smearing

Since I could not find the corresponding algorithm in MicroAODAlgos, I am opening this issue to make sure that it is not forgotten.

Migrate to 73X

It appears that CMSSW_7_3_2 is a sufficiently stable migration target. However, some of the effort (e.g. on jet tools) is non-trivial. I propose we do this after the Higgs Workshop to avoid confusion.

Untagged category 2 ignored by sorted configuration

OOps! Misconfiguration on my part. Look closely here:

https://github.com/cms-analysis/flashgg/blob/87027e5c7585bb4028dbc9d2ee113872b48e9c16/TagProducers/python/flashggTagSorter_cfi.py

oldval / newval print statements from GBRLikelihood

We get these statements from the diphoton MVA code:

oldval = 0.010000, newval = -1.289817, evaluate = 0.010000
oldval = 1.000000, newval = -0.111341, evaluate = 1.000000
oldval = 2.000000, newval = -1.542650, evaluate = 2.000000
oldval = 2.000000, newval = -1.542650, evaluate = 2.000000

They come from here:

https://github.com/bendavid/GBRLikelihood/blob/4fda233acf853c38ce657313fd7259957bd874b7/src/RooHybridBDTAutoPdf.cc#L281

We get that version by using the following tag:

git clone -b hggpaperV8 https://github.com/bendavid/GBRLikelihood

TODO: fork and make a flashgg branch (as we already do for GBRLikelihoodEGTools) to get rid of this messages, then update instructions.

Reduce size of PileupSummaryInfos

Currently PileupSummaryInfos_addPileupInfo takes up the second-largest amount of space in the expanded MicroAOD:

File file:myMicroAODOutputFile.root Events 259
Branch Name | Average Uncompressed Size (Bytes/Event) | Average Compressed Size (Bytes/Event)
flashggJets_flashggJets__FLASHggMicroAOD. 76707.5 9590.08
PileupSummaryInfos_addPileupInfo__HLT. 14749.1 5354.72

TODO: filter or otherwise reduce the size of this collection

Conversions in VertexSelector

LegacyVertexSelector.cc#L481

float nConv = conversionsVector.size();

The MVA should receive as input the how many, out of the two photons, are converted (i.e., 0,1,2)
Like I guess it's looking at the number of conversions in the event that it gets from the DiPhotonProducer:
Handle<Viewreco::Conversion > conversions;
evt.getByToken(conversionToken_,conversions);
const PtrVectorreco::Conversion& conversionPointers = conversions->ptrVector();

A possible patch for the code could be:

L477

  float nConv = 0;
  if (IndexMatchedConversionLeadPhoton != -1) ++nconv;
  if (IndexMatchedConversionTrailPhoton != -1) ++nconv;

  float pull_conv = -999;
  if (nconv !=0){

double zconv=0;
double szconv=0;
zconv=getZFromConvPair(g1,g2,IndexMatchedConversionLeadPhoton,IndexMatchedConversionTrailPhoton,conversionsVector,beamSpot);
szconv=getsZFromConvPair(g1,g2,IndexMatchedConversionLeadPhoton,IndexMatchedConversionTrailPhoton,conversionsVector); 

if (szconv != 0) pull_conv = fabs(vtx->position().z()-zconv)/szconv;
else pull_conv = 10.;   

if (pull_conv > 10.) pull_conv = 10.;  
  }

  logsumpt2_=log(sumpt2_in+sumpt2_out);
  ptbal_=ptbal;
  pull_conv_=pull_conv;
  nConv_=nConv;

I xchecked with Pasquale that this is indeed correct:
if (pull_conv > 10.) pull_conv = 10.;

pass vector<Ptr> through interfaces instead of Handle/Event

Can't read in flashggPreselectedDiPhotons

For packages downstream from diphoton preselection, this works:

DiPhotonTag=cms.untracked.InputTag('flashggDiPhotons')

But this doesn't:

DiPhotonTag=cms.untracked.InputTag('flashggPreselectedDiPhotons')

Error message below. To-do: figure out why not.

----- Begin Fatal Exception 06-Oct-2014 16:50:58 CEST-----------------------
An exception of category 'ProductNotFound' occurred while
[0] Processing run: 1 lumi: 39 event: 3801
[1] Running path 'p'
[2] Calling event method for module FlashggDiPhotonMVAProducer/'flashggDiPhotonMVA'
Exception Message:
Principal::getByToken: Found zero products matching all criteria
Looking for a container with elements of type: flashgg::DiPhotonCandidate
Looking for module label: flashggPreselectedDiPhotons
Looking for productInstanceName:

Tags: sorted collections

Set up Tag producers so that tags are organized with operator< automatically from each tag producer. Use ordered CMSSW collections if possible; check how it's done for PatCandidates. (Check if this feature simplifies tag sorter logic.)

Review MicroAOD and Tag output collections

Necessary for launching a test on new PHYS14 samples, which are coming out now.

Tab length

This annoys me regularly, so I am opening an issue, but feel free to close it if I am the only one in this case... One example of how this affect code readability would be [1].

Is there a way to agree on some common setting ? Like tab implemented as 4 spaces ?

There used to be some settings like that in globe [2], but they are file-by-file, would anyone know a way to define them for the full repository ?

Cheers,
Olivier

[1] InnaKucher@e5b6f70#diff-0
[2] https://github.com/h2gglobe/h2gglobe/blob/master/PhotonAnalysis/src/PhotonAnalysis.cc#L6902-L6908

Remove debug messages, throw exceptions in a uniform way

Add electron ID and MVA cuts to tags using electrons

Since PR #152 there are no longer any cuts to electrons in the ElectronProducer. (I kept the code but added an ApplyCuts flag that is set to false by default.) These cuts should be put into the Tag Producers that use electrons instead.

BeamSpotHandle not valid

In this file:

root://eoscms//eos/cms/store/cmst3/user/gpetrucc/miniAOD/v1/GluGluToHToGG_M-125_13TeV-powheg-pythia6_Flat20to50_PAT.root

which appears to contain the beamspot:

reco::BeamSpot "offlineBeamSpot" "" "RECO"

the beam spot handle IsValid() method is false (and if you try to read anything you get a ProductNotFound exception). To-do: figure out why.

method in conversions

MicroAODAlgos/plugins/ZerothVertexSelector.cc

The variable

int method=0;

appears in different places each time initialised to the default value of zero. It should become global to avoid going out of synch in different functions.

Clean up instructions

The ones here and also in the GitHub README.md

https://twiki.cern.ch/twiki/bin/viewauth/CMS/FLASHggFramework#Quick_start_instructions_for_fir

Make sure the two versions are synchronized.

Make sure the instructions don't cause developers to work on top of an out-of-date version.

Include multiple use cases for new developers, experienced developers, test users, ...

Additional gen information

Additional gen-level information should be kept. In particular:

all hard process particles should be added to the flashggPrunedGenParticles.
MC matching should be added to the photon producer.

Metadata and AAA

Metadata scripts will always look for files on eos unless useAAA=1 is set. This is OK for FWL but in
CMSSW (full framework) by default when encounters a /store/... should use site configuration.

Make eventContent cff for microAOD

I guess that it should go to

MicroAODProducers/python/flashggMIcroAODOutpuCommands_cff.py

or something similar.

Current output in test scripts is:

outputCommands = cms.untracked.vstring("drop *",
"keep *_flashgg*_*_*",
"drop *_flashggVertexMap*_*_*",
"keep *_offlineSlimmedPrimaryVertices_*_*",
"keep *_reducedEgamma_reduced*Clusters_*",
"keep *_reducedEgamma_*PhotonCores_*",
"keep *_slimmedElectrons_*_*",
"keep *_slimmedMuons_*_*",
"keep *_slimmedMETs_*_*",
"keep *_slimmedTaus_*_*",
"keep *_fixedGridRhoAll_*_*"
)

One needs at least to add beamspot and gen info.

Attach Gen Info to Tags

E.g. closest "important"/"relevant" gen object to each object used for the tag.

EB cut

MicroAODAlgos/plugins/ZerothVertexSelector.cc

Every where (exe L205 L246 L278) :
pho->eta() <1.5 ---> fabs(pho->eta()) <1.5

Cleanly separate MicroAOD-making and Tag Production

Necessary for launching a test on new PHYS14 samples, which are coming out now.

Use standard PileupJetId recipes/recommendations

Eventually there should be a method for JME-recommended pileup jet id selection, and we should use that instead of a raw MVA value cut. However, currently all we have is a float MVA output, from an MVA training based on antiKt5 jets in Run1. In MiniAOD, in fact, the float output for this is all that is saved for the jets. It's run before the jet is slimmed for miniAOD, and then saved in the miniAOD jets as a userFloat: see https://twiki.cern.ch/twiki/bin/view/CMSPublic/WorkBookMiniAOD#Jets for more on this.

For MicroAOD, we have written a new method to use the DzVertexMap and rerun the MVA w.r.t. a non-standard vertex. In fact, this method is ahead of JME, which as of the last time I checked had not provided any code yet for computing Pu Jet ID on miniAOD. I have been in communication with the JME person responsible for this update and provided him with our code in case it's useful.

As they become available, we should adopt the most standard tools that do the job we need, and we should provide feedback to JME if there is an additional feature they should add that would let us use a more standardized recipe.

MetaData scripts completion

Opening this issue to make sure we don't forget this.

As they are, MetaData scripts need to be completed to:

Have ability to resubmit failed jobs.
Be more accurate at ensuring reproducibility of task running.
For the most important jobs we should store tgz with config and libraries and
set up scram scratch space at run time (a la crab).
Compute per-sample PU distribution and generate PU weights.
Keep an eye on duty cycle and automatically resubmit stuck jobs to cope
with eos shortcomings.
Define generic batch interface to support on-LSF clusters.
Use fwk job report to ensure that full dataset was processed

Embed rechits in flashgg::Photon

Some recHits should be embedded in the flashgg::Photons.

The seed crystal recHit should be the very least. We may also consider to keep the full 5x5.

One potential issue with the 5x5 may be the data duplication in the photon and di-photon collections. It this becomes an issue, we could zero all the rechits except the seed just before copying to the di-photon object.

This is in fact a more general issue, related to the duplication of several variables which we inherit directly from the pat and reco photons.

Validation Code in flashgg

As mentioned during some of our flasgg meetings, would be nice to share the validation code we are using, I already did it with ours: https://github.com/cms-analysis/flashgg/blob/master/Commissioning/fwlite/plot_Vertex_ValidationFLASHgg.C I remember very nice plots from Inna and Francesca would be nice to be able to reproduce them!
Thanks!

Keep the link between the photon collection and photons stored into the diphoton object

Some flashgg::Photon methods/data members shadow pat::Photon ones

In particular those related to regression inputs:

https://github.com/cms-analysis/flashgg/blob/master/DataFormats/interface/Photon.h#L25

http://cmslxr.fnal.gov/lxr/source/DataFormats/PatCandidates/interface/Photon.h#0234

Also, in the case where the methods are not explicitely shadowed, we end up generating confusion on which method is being called.

I would suggest to add a "fgg" prefix (or similar) to both the data member name and the getters/setters.