arogozhnikov / hep_ml Goto Github PK

Machine Learning for High Energy Physics.

Home Page: https://arogozhnikov.github.io/hep_ml/

License: Other

Shell 0.05% Python 26.38% Jupyter Notebook 72.10% Makefile 0.75% CSS 0.02% Batchfile 0.70%

machine-learning high-energy-physics python splot neural-networks boosting-algorithms reweighting-algorithms scikit-learn

hep_ml's People

Stargazers

Watchers

Forkers

anaderi parkinson-z zhmz90 nickcdryan ageek remenska yandexdataschool vkuznet konstantinschubert tlikhomanenko magicknight heyley alexpearce jonas-eschle koongah clustersdata pseyfert leoredi marinang thegreenjedi shivamsingh17 felix-cormier acuddeback bastianatte rashmishk lyudmilatretyakova lavanyashukla stjordanis phyamal glukicov holybayes itsayushthada awoziji pranay144 pandinosaurus kgizdov ayubot acampove a-dlop shahip2016 gputtley kanhaiya-gupta hconanb ydeh22 eddieshields ahill187 juanbsleite rnair111 callumwhite928 albinjames tdl77 chrisburr richard-lane ronitinvecc mattmonk chrinide cesca2 mumuhamm alorikakar2014 goi42

hep_ml's Issues

Add initial constant for gradient boosting

Currently regression lacks initial constant, which may be resolved i.e. by iterative auto algorithm. Constants for losses are substituted by reweighting

Negative sWeights

Hi,

I am trying to use the BoostingToUniformity notebook, in particular the uBoost classifier. I am getting the error message 'the weights should be non-negative'. I have tried removing this from the source code and tried to run uBoost without this line. When I use the 'predict' function I get an array of all zeros and when I try to plot the ROC curve I get nans as the output. I am wondering if there is a way of dealing with negative weights?

Many thanks,

Martha

add notebook about neural networks

Things to note

prescaling
number of layers/etc
writing new neural network

Enhancements in API

Get rid of most parameters in updating regression tree.
There are too many parameters never used by any of loss functions.
Also, we can remove negative gradient from API

Leaves with no samples from original distribution

This issue was observed and reported by Jack Wimberley.

If there is a region with very few original samples, decision tree can build a leaf with samples only from target distribution (> min_samples_leaf) and 0 (exactly zero) from original.

As a result, 'corrections' made by a tree do not affect train weights, but this results in blowing up weights on the test.

Workarounds

Basically, almost anything from

increase min_samples_leaf
subsample=0.5
increase regularization (available in develop version)

(and any combination of the above) works well and resolves the problem in practice.

Proper solution (not available now)

Good, correct solution would be to introduce parameter 'minimal number of samples from original distribution in a leaf', but this isn't supported by decision trees of scikit-learn (or any other library).

List of gb_args options

Hi,
I'm using the Gradient Boosted Reweighter object GBReweighter from the hep_ml.reweight package.
I was wondering if it could be possible to have the list of the possible options that can be passed to the GBReweighter via the "gb_args" dictionary.

Thanks!
Pietro

compile with -fPIC option

hi, I'm trying to use the MLPClassifier from hep.nnet, when running the code I get an error that suggest to add -fPIC during the Python compilation.
I tried adding the -fPIC option in the makefile an also change from CFLAGS=... to CFLAGS+=... as suggested in some other discussions but both ways failed.
I attach in the zip both the error that I receive and the Makefile generated via ./configure, I build it with gcc48 on slc6 x86_64 system with an lxplus-like configuration.
Issues.zip
Somone have an idea what I'm doing wrong?

Add notebooks to documentation?

This will probably resolve an issue of information being at several places at once

Publish data for notebooks

Create a separate branch and test with wget and raw.github.com

uBoost Convergence

Hello,

How could I check the convergence of uBoost when using uniforming_rate (alpha) != 0?. When I plot the log-loss metric vs number of boostings I see it increases, with a rate proportional to the alpha value used. You can see this trend in the plot attached. On the other hand, I can make the log-loss to converge with another hyper-parameter configuration (for the same alpha) but then I don't get an uniform selection. How can I deal with this?, does it mean that the log-loss is not a good metric to check convergence in this case?.
uboost_vs_adaboost.pdf

Thanks very much,
Gino

Build autoreweighting inside each loss function

Currently many loss function simply ignore this step.

Alternatively, we can add 'find a constant' procedure to get right constant.
This option is more general.

uBoost de-correlation power

Hello,

I'm trying to run uBoost to get a flat bkg efficiency with mass. In particular, I want the efficiency to be flat at 8% of bkg efficiency. To do this I used uBoostBDT and set 'target_efficiency'=0.08
and 'uniform_label': 0. I ran GridSearchCV to get the best hyper-parameters and trained on those for ~100 boostings and with different 'uniform_rate' values, e.g [0,5,10,15,20].
Looking at the bkg efficiency vs mass plots I see that at "bkg.eff. = 92%" profile gets gradually more flat, as 'uniform_rate' increases, which is exactly the behaviour I want, but for the wrong profile!. This made me suspect that to get a flat 8% bkg.eff I need to set 'target_efficiency'=0.92.

Looking at the code, I see there is a flip of sign in the clf. score
[https://github.com/arogozhnikov/hep_ml/blob/master/hep_ml/uboost.py#L182
self.signed_uniform_label = 2 * self.uniform_label - 1

[https://github.com/arogozhnikov/hep_ml/blob/master/hep_ml/uboost.py#L243]
signed_score = score * self.signed_uniform_label

So my interpretation of 'target_efficiency' is different for the two classes: when flattening signal, it is the amount of signal to keep; when flattening bkg, it is the amount of bkg to discard.

Is this reasoning correct?

Thanks in advance,
Gino

Add multiclassification example to hep_ml.nnet

currently needed for comparison in applications

add standalone sPlot technique

Try travis to test hep_ml on mac os X

Error propagation from weights

Hello @arogozhnikov ,

I am using the GBReweighter. Before the weights are applied, I can assume that my datasets is described by

{x1, x2...}

after the weights are applied the dataset is:

{(xi, wi) : i in [1, n]}

i.e. it depends on the weights. Therefore if I had initially a function f (xi), now that function is f(xi, wi). The weights wi are dependent on the knowledge of the data (target) and simulation (original) distributions. However we have finite samples for these and the weights should be assigned an error so that we could estimate the propagated error in f(xi, wi). Is there a way to estimate the error in these wi weights? How are these errors correlated, because those correlations would be needed to estimate the propagated error on f(xi, wi).

Cheers.

Assertion Error with UGradientBoost

Hi I am wondering if you can help me.

I am getting the following error when trying to use UGradientBoost:

Traceback (most recent call last): File "uBoost_test.py", line 196, in <module> main() File "uBoost_test.py", line 30, in main train_classifier(dataframe, mode, year) File "uBoost_test.py", line 101, in train_classifier ugradientboost.fit(X_train, Y_train, w_train) File "/afs/cern.ch/user/m/mhilton/.local/lib/python3.6/site-packages/hep_ml/gradientboosting.py", line 205, in fit return UGradientBoostingBase.fit(self, X, y, sample_weight=sample_weight) File "/afs/cern.ch/user/m/mhilton/.local/lib/python3.6/site-packages/hep_ml/gradientboosting.py", line 131, in fit residual, weights = self.loss.prepare_tree_params(y_pred) File "/afs/cern.ch/user/m/mhilton/.local/lib/python3.6/site-packages/hep_ml/losses.py", line 118, in prepare_tree_params return self.negative_gradient(y_pred), numpy.ones(len(y_pred)) File "/afs/cern.ch/user/m/mhilton/.local/lib/python3.6/site-packages/hep_ml/losses.py", line 753, in negative_gradient neg_gradient = self._compute_fl_derivatives(y_pred) * self.fl_coefficient File "/afs/cern.ch/user/m/mhilton/.local/lib/python3.6/site-packages/hep_ml/losses.py", line 748, in _compute_fl_derivatives assert numpy.all(neg_gradient[~numpy.in1d(self.y, self.uniform_label)] == 0) AssertionError

I am wondering if you can explain this last assert line and why this is happening?

Many thanks.

search on hyper parameters

Hi,

I am using this package to reweight MC to look like sPlotted data, and I would like to scan the hyper parameters to look for the best configuration
scikit tools are available for this (e.g. GridSearchCV or RandomizedSearchCV), but I am having troubles interfacing the two packages
Has anyone done that? Are there alternative ways within hep_ml?

In particular, I have my pandas DataFrame for the original and target samples and I am trying something like

        GBreweighterPars = {"n_estimators"     : [10,500],
                            "learning_rate"    : [0.1, 1.0],
                            "max_depth"        : [1,5],
                            "min_samples_leaf" : [100,5000],
                            "subsample"        : [0.1, 1.0]}

        reweighter = reweight.GBReweighter(n_estimators     = GBreweighterPars["n_estimators"],
                                           learning_rate    = GBreweighterPars["learning_rate"],
                                           max_depth        = GBreweighterPars["max_depth"],
                                           min_samples_leaf = GBreweighterPars["min_samples_leaf"],
                                           gb_args          = {"subsample" : GBreweighterPars["subsample"]})

        gridSearch = GridSearchCV(reweighter, param_grid = GBreweighterPars)

        fit = gridSearch.fit(original, target)

but I get the following error

  File "mlWeight.py", line 273, in <module>
    rw = misc.reWeighter(ana, clSamples, inSamples, cutCL + " && " + cutEvt[i], cutMC + " && " + cutEvt[i], weightCL, weightMC, varsToMatch, varsToWatch, year, trigger, name + "_" + str(i), inName, useSW, search, test, add)
  File "/disk/moose/lhcb/simone/RD/Analysis/RKst/ml/misc.py", line 464, in reWeighter
    fit = gridSearch.fit(original, target)
  File "/disk/moose/lhcb/simone/software/anaconda/4.0.0/lib/python2.7/site-packages/sklearn/model_selection/_search.py", line 940, in fit
    return self._fit(X, y, groups, ParameterGrid(self.param_grid))
  File "/disk/moose/lhcb/simone/software/anaconda/4.0.0/lib/python2.7/site-packages/sklearn/model_selection/_search.py", line 539, in _fit
    self.scorer_ = check_scoring(self.estimator, scoring=self.scoring)
  File "/disk/moose/lhcb/simone/software/anaconda/4.0.0/lib/python2.7/site-packages/sklearn/metrics/scorer.py", line 273, in check_scoring
    "have a 'score' method. The estimator %r does not." % estimator)
TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator GBReweighter(gb_args={'subsample': [0.1, 1.0]}, learning_rate=[0.1, 1.0],
       max_depth=[1, 5], min_samples_leaf=[100, 5000],
       n_estimators=[10, 500]) does not.

However, I am not sure how to set the score method for GBReweighter

Any help/suggestions/examples would be much appreciated

Add MAELossFunction

mean absolute error (there will be some problems with predicting values in leaves), but still worth adding

Saving uboost BDT with tf/keras base estimators

Hi,

I am trying to use a uBoost BDT to achieve uniform signal efficiency. My base estimator is a Keras model (Tensorflow 2.2), which I have written as a scikit-learn BaseEstimator subclass using tensorflow.keras.wrappers.scikit_learn.KerasClassifier. The training and everything seems to work fine, but I am encountering an error when I try to save the uboost classifier with pickle/joblib. The error is TypeError: can't pickle _thread.RLock objects
(full error at bottom - it is mostly a long thread of calls to pickle )

From trying to look it up it seems the error is usually to do with the way tensorflow is run, but I'm only creating a simple model and fitting and all the session handling should be taken care of in this version of tf/keras. Maybe this answer is related keras-team/keras#8343 (comment)
ie. perhaps there is a call to something from the model that leaves an unserializable tensor object? As I am using the BDT not the classifier, I assume it is not to do with any parallel processes either?

Please let me know if you know what is causing the issue or if there is some way I can work around it.

Thanks!

initial version of speedup module

Some optimizers of classifier's application

Exporting/saving/reusing the reweighting formula

Sometimes one would like to use a control sample, e.g. because more abundant, to determine MC weights to be then applied to other, e.g. more rare, samples

For this reason it would be very useful if hep_ml.reweight could export the "reweighting formula" in some format, e.g. ROOT, so that it can be reused also from different programming languages

Thanks

sklearn deprecation warning

Importing hep_ml raises a deprecation warning for sklearn:

$ python -c 'from hep_ml import splot
/Users/apearce/.virtualenvs/foobar/lib/python2.7/site-packages/sklearn/cross_validation.py:44:
DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection
module into which all the refactored classes and functions are moved. Also note that the interface
of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

Global renaming of variables to features

For consistency with REP (and the whole ML world):

uniform_features
train_features
uniform_label

Create starter version for kaggle competition

To demonstrate fighting with correlation

Issue interfacing uboost classifiers with rep grid search: Exception an integer is required

Hi,

I have been trying to add uBoost to my grid search in REP and have encountered some difficulties.
I have made a minmal example of the error I get:
`
df = pandas.DataFrame(np.random.randn(8, 4),columns=['A', 'B', 'C', 'D'])

df['E'] = 1

df['E'][3:] = 0

labels = df['E']

data = df.drop('E',axis=1)

uni_feats = 'C'

variables = ['A','B','D']

uboost_clf = uBoostClassifier(uniform_features=uni_feats, uniform_label=1,
train_features=variables)

grid_param = {}

grid_param['n_estimators'] = [50,100,125,150]

grid_param['n_neighbors'] = [50,51,52,53]

generator = RandomParameterOptimizer(grid_param,n_evaluations=2)
scorer = FoldingScorer(RocAuc(), folds=3, fold_checks=3)
estimator = SklearnClassifier(uboost_clf)
grid_finder = GridOptimalSearchCV(estimator, generator, scorer, parallel_profile='threads-4')
grid_finder.fit(data, labels)
`

This always results in the error:

Performing grid search in 4 threads ERROR:rep.metaml.gridsearch:Fail during training on the node Exception an integer is required Parameters n_estimators=150, n_neighbors=52 ERROR:rep.metaml.gridsearch:Fail during training on the node Exception an integer is required Parameters n_estimators=125, n_neighbors=52 2 evaluations done

I have had a look but i've had no luck finding the source of the exception and im a bit puzzled as to what is causing this, the same code works for a number of other classifiers.

Is this just a case of something which is not supported by uboost?

Any help or clarification here would be greatly appreciated,
Ryan

Add folding for reweighters

subj. This also shall be mentioned in documentation and howto notebook

computeSignalKnnIndices, computeKnnIndicesOfSameClass

finally resolve this names and strategy of passing 'neighbor features'

Persistify GBReweighter instance

Hello Alex,

I am using the GBReweighter class:

https://arogozhnikov.github.io/hep_ml/reweight.html?highlight=gbreweighter#hep_ml.reweight.GBReweighter.predict_weights

and I would like to know if there is a native way to persistify an instance of this class so that it can be used to predict_weights afterwards. I cannot find anything native in the documentation and I was thinking that maybe something like pickle might work.

Cheers.

Rewrite readme after release

sPlot explanation and examples

It worth to add special notebook with explanations of sPlot (and probably some examples of fitting).

weight normalisation

I have used hep_ml in the past weeks to reweight MC distributions and stumbled upon the following issue
When determining weights as data/MC ratio of normalised distributions, the computed weights are normalised such as Sum w_i = N
However, I noticed this is not the case for weights obtained using hep_ml.reweight
Is this expected or am I missing something?

nnet improvements

Add support of noise and dropouts.

Replace reweighting examples

Need to add sWeights as default behavior. And set original weights to ones.

Add data for reweighting examples and initial notebook

Default value for uniform label

Either put uniform_label=0 everywhere or leave without default value.

Question for multidimensional fit

Hi,

I am using your tool for reweighing MC to make it look like splotted data (thanks!)
I am currently following the examples and I am able to make it work for 1D, but I am not sure how to do it for more than one dimension. I mean, what I need is to find the MC weights when more than one variable presents disagreement between MC and data.

How do I do it?

Thanks!

Vicente Rives Molina

Strange values of bin edges

I found a very strange behaviour of the Lookup Classifier calculate bin edges feature.
The issue occurs only for integer type feature.
Here you have exemplary features edges
{"seed_nbIT", {0.0,0.0,0.0,}},
{"seed_nLayers", {11.0,12.0,12.0,}},
I attached the features distribution plots.

Comparison of GB & loss policies

Probably some test notebook would be helpful to see how changes in losses affect training process.

Multidimensional reweighting

Hi,

I'm using hep_ml to perform a multidimensional reweighting of a MC sample and it is working really well. I have a question, to which I've not been able to find an answer in the paper (https://arxiv.org/pdf/1608.05806.pdf) or in the documentation (https://arogozhnikov.github.io/hep_ml/reweight.html).

The multidimensional space is split in large bins by optimising a symmetrised chi2. But are those bins multidimensional or 1-dimensional ? In other words, is hep_ml reweighting 1D distributions iteratively ? Or is it reweighting multidimensional distributions directly ?

Cheers,
Maxime

Nominal weights when correcting already weighted original

Hi, I'm trying to correct the distribution D in an original (MC) sample that already has some weights, say w_i, that correct something else (say Dp). The way I'm currently doing this is I obtain weights, say x_i, by calling predict_weights(original = D_array, original_weight = w).
My question is the following: once I've done this, do I have to use x_i or w_i * x_i as nominal weights for my MC (i.e. to have both D and Dp corrected)? If the answer is x_i, then very naively one could assume that the ratio of the two sets of corrections (x_i, w_i) would yield something that corrects Dp but not D. Is this assumption correct?

Cheers,
Dan

Using sWeights with GBReweighter

Hi,

I've noticed some issues with very large weights using GBReweighter. I am trying to reweight data to look like some toy data I have. I have signal sWeights for the data but not for the toy data.
I've trained the BDT using only the original_weight but not target_weight arguments. Reading previous issues I've tried to ensure that there is overlap in my data and toy distributions. I'm using 800k toy and 500k data events which I think should be enough.
Do I need sWeights information for both datasets?
Without the sWeights the reweighting is much better, however how should I account for background if I don't use sWeights?

Thanks

New release?

Hey,
what do you think of a new release? The last release is over a year ago and some nice things have been added since (like loss_regularization parameter etc.).

Moreover, it is very unhandy to specify the github version as a pip dependency (most of all as an optional dependency), a new pip version would be really great.

Binned chi^2 definition

I saw your presentation on “Reweighting distributions with gradient boosting” and it looked great, so gave it go. But now I want to explain it to others, so have to actually understand how it works 😄

One thing I'm not certain on is how the node splitting is determined, i.e. how the value of the split of the training data along a feature axis at a node is determined. You say this is the “symmetrized binned chi^2”, and I'd like to check that my understanding of what that is is correct. I have a notebook to try reproduce your plot. It looks similar, but I might have done something wrong nevertheless. Does it look sensible?

I tried to find where this computation is done in the code, but I couldn't find it. I'm not at all familiar with the general scikit-learn code architecture, so it's just that I have trouble following the flow of all the Xs and ys. Could you point me to where the chi^2 computation is done?

(And, of course, thanks for the excellent package! 🍻)

Odd behaviour of GBReweighter

I am trying to use GBReweighter and am getting odd behaviour of the weights as seen in the figures attached. My parameters are as follows:
reweighter = GBReweighter(n_estimators=40, learning_rate=0.1, max_depth=3, min_samples_leaf=1000, gb_args={'subsample': 0.4})
I have tried varying the parameters.
Do you know what might cause this behaviour?
Many thanks.
D02KSPiPiDD_2012_original.pdf
D02KSPiPiDD_2012_reweighted.pdf

Speedup with XGboost classifier runtime error

I have tried to use speedup.LocukpClassifier with XGboost as a base_estimator but I failed. This may be a bug in LockupClassifier implementation.
I executed following python code:

train_X, test_X, train_Y, test_Y = train_test_split(new_features, target, random_state=42,train_size=0.5 )              

base_classifier = xgb.XGBClassifier(n_estimators=400, learning_rate=0.07 ,scale_pos_weight=ratio_ghost_to_good)
classifier = LookupClassifier(base_estimator=base_classifier, keep_trained_estimator=False)
classifier.fit(train_X, train_Y)

And obtained following error code

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-afeadcdc082e> in <module>()
      3 base_classifier = xgb.XGBClassifier(n_estimators=400, learning_rate=0.07 ,scale_pos_weight=ratio_ghost_to_good)
      4 classifier = LookupClassifier(base_estimator=base_classifier, keep_trained_estimator=False)
----> 5 classifier.fit(train_X, train_Y)

/afs/cern.ch/user/a/adendek/.local/lib/python2.7/site-packages/hep_ml/speedup.pyc in fit(self, X, y, sample_weight)
     91         all_lookup_indices = numpy.arange(int(n_parameter_combinations))
     92         all_combinations = self.convert_lookup_index_to_bins(all_lookup_indices)
---> 93         self._lookup_table = trained_estimator.predict_proba(all_combinations)
     94 
     95         if self.keep_trained_estimator:

/afs/cern.ch/user/a/adendek/.local/lib/python2.7/site-packages/xgboost/sklearn.pyc in predict_proba(self, data, output_margin, ntree_limit)
    475         class_probs = self.booster().predict(test_dmatrix,
    476                                              output_margin=output_margin,
--> 477                                              ntree_limit=ntree_limit)
    478         if self.objective == "multi:softprob":
    479             return class_probs

/afs/cern.ch/user/a/adendek/.local/lib/python2.7/site-packages/xgboost/core.pyc in predict(self, data, output_margin, ntree_limit, pred_leaf)
    937             option_mask |= 0x02
    938 
--> 939         self._validate_features(data)
    940 
    941         length = ctypes.c_ulong()

/afs/cern.ch/user/a/adendek/.local/lib/python2.7/site-packages/xgboost/core.pyc in _validate_features(self, data)
   1177 
   1178                 raise ValueError(msg.format(self.feature_names,
-> 1179                                             data.feature_names))
   1180 
   1181     def get_split_value_histogram(self, feature, fmap='', bins=None, as_pandas=True):

ValueError: feature_names mismatch: [u'seed_chi2PerDoF', u'seed_p', u'seed_pt', u'seed_nLHCbIDs', u'seed_nbIT', u'seed_nLayers', u'seed_x', u'seed_y', u'seed_tx', u'seed_ty', u'abs_seed_x', u'abs_seed_y', u'abs_seed_tx', u'abs_seed_ty', u'seed_r', u'pseudo_rapidity'] ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15']
expected seed_nbIT, abs_seed_y, abs_seed_x, seed_tx, seed_pt, seed_nLayers, seed_x, seed_y, seed_ty, pseudo_rapidity, seed_p, seed_r, abs_seed_tx, abs_seed_ty, seed_nLHCbIDs, seed_chi2PerDoF in input data
training data did not have the following fields: f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f12, f13, f10, f11, f14, f15

sPlot returns NAN sWeights

I am currently trying to use help_ml splot and am getting some sWeights as nan. I am using a sample of ~1.4M events and this seems to be happening after event ~200k. I have checked the signal and background probabilities and these look reasonable. I have also checked the sWeights before event ~200k and these also look reasonable. I have checked the sWeighted signal and background distributions for a relevant pT variable and these also look ok.

So I am wondering is there some reason they will not be calculated correctly after a certain event? Any help would be much appreciated.

Add examples of very custom neural networks

Prepare some demonstration notebook of uniformity methods

Random behavior of GBReweighter and UGradientBoostingClassifier

(Leaving this as an open answer to common question)

Why GBReweighter/UGradientBoostingClassifier provide different weights after each training?

Both algorithms are based on stochastic tree boosting. Settings like subsample and max_features drive to randomized tree building (i.e. each tree uses only random part of train data), which is widely known to strengthen ensemble by building more diverse trees.

hep_ml follows sklearn convention to keep random things random unless explicitly asked otherwise.

Reproducible behavior is achieved with setting random_state

for boosting:
UGradientBoostingClassifier(<other setting here>, random_state=42)
for reweighter
GBReweighter(<other setting here>, gb_args={'random_state': 42, <other gb args>})