Coder Social home page Coder Social logo

hmumuml's Introduction

Welcome to HmumuML

HmumuML is a package that can be used to study the XGBoost BDT categorization in Hmumu analysis.

Setup

This project requires to use python3 and either conda or virtual environment. If you want to use a conda environment, install Anaconda or Miniconda before setting up.

First time setup on lxplus

First checkout the code:

git clone ssh://[email protected]:7999/HZZ/HZZSoftware/Hmumu/hmumuml.git [-b your_branch]
cd HmumuML

Then,

source scripts/install.sh

Normal setup on lxplus

After setting up the environment for the first time, you can return to this setup by doing source scripts/setup.sh

Scripts to run the tasks

Prepare the training inputs (skimmed ntuples)

The core script is skim_ntuples.py. The script will apply the given skimming cuts to input samples and produce corresponding skimmed ntuples. You can run it locally for any specified input files. In H->mumu, it is more convenient and time-efficient to use submit_skim_ntuples.py which will find all of the input files specified in data/inputs_config.json and run the core script by submitting the condor jobs.

  • The script is hard-coded currently, meaning one needs to directly modify the core script to change the variable calculation and the skimming cuts.
  • The output files (skimmed ntuples) will be saved to the folder named skimmed_ntuples by default.
  • submit_skim_ntuples.py will only submit the jobs for those files that don't exist in the output folder.
python scripts/submit_skim_ntuples.py

or

python scripts/skim_ntuples.py [-i input_file_path] [-o output_file_path]

Check the completeness of the jobs or recover the incomplete jobs

The script run_skim_ntuples.py will check if all of the outputs from last step exist and if the resulting number of events is correct.

python scripts/run_skim_ntuples.py [-c]

usage:
  -c   check the completeness only (without recovering the incomplete jobs)
  -s   skip checking the resulting number of events

Please be sure to do python run_skim_ntuples.py -c at least once before starting the BDT training exercise.

Start XGBoost analysis!

The whole ML task consists of training, applying the weights, optimizing the BDT boundaries for categorization, and calculating the number counting significances. The wrapper script run_all.sh will run everything. Please have a look!

Training a model

The training script train_bdt.py will train the model in four-fold, and transform the output scores such that the unweighted signal distribution is flat. The detailed settings, including the preselections, training variables, hyperparameters, etc, are specified in the config file data/training_config.json.

python scripts/train_bdt.py [-r TRAINED_MODEL] [-f FOLD] [--save]

Usage:
  -r, --region        The model to be trained. Choices: 'zero_jet', 'one_jet', 'two_jet' or 'VBF'.
  -f, --fold          Which fold to run. Default is -1 (run all folds)
  --save              To save the model into HDF5 files, and the pickle files

Applying the weights

Applying the trained model (as well as the score transformation) to the skimmed ntuples to get BDT scores for each event can be done by doing:

python scripts/apply_bdt.py [-r REGION]

The script will take the settings specified in the training config file data/training_config.json and the applying config file data/apply_config.json.

Optimizing the BDT boundaries

categorization_1D.py will take the Higgs classifier scores of the samples and optimize the boundaries that give the best combined significance. categorization_2D.py, on the other hand, takes both the Higgs classifier scores and the VBF classifier scores of the samples and optimizes the 2D boundaries that give the best combined significance.

python categorization_1D.py [-r REGION] [-f NUMBER OF FOLDS] [-b NUMBER OF CATEGORIES] [-n NSCAN] [--floatB] [--minN minN] [--skip]

Usage:
  -f, --fold          Number of folds of the categorization optimization. Default is 1.
  -b, --nbin          Number of BDT categories
  -n, --nscan         Number of scans. Default is 100
  --minN,             minN is the minimum number of events required in the mass window. The default is 5.
  --floatB            To float the last BDT boundary, which means to veto the lowest BDT score events
  --skip              To skip the hadd step (if you have already merged signal and background samples)
python categorization_2D.py [-r REGION] [-f NUMBER OF FOLDS] [-b NUMBER OF CATEGORIES] [-b NUMBER OF ggF CATEGORIES] [-n NSCAN] [--floatB] [--minN minN] [--skip]

Usage:
  -f, --fold          Number of folds of the categorization optimization. Default is 1.
  -b, --nbin          Number of BDT categories
  -n, --nscan         Number of scans. Default is 100
  --minN,             minN is the minimum number of events required in the mass window. The default is 5.
  --floatB            To float the last BDT boundary, which means to veto the lowest BDT score events
  --skip              To skip the hadd step (if you have already merged signal and background samples)

hmumuml's People

Stargazers

Jay Chan avatar

Watchers

James Cloos avatar Jay Chan avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.