Coder Social home page Coder Social logo

arogozhnikov / hep_ml Goto Github PK

View Code? Open in Web Editor NEW
175.0 17.0 64.0 94.3 MB

Machine Learning for High Energy Physics.

Home Page: https://arogozhnikov.github.io/hep_ml/

License: Other

Shell 0.05% Python 26.38% Jupyter Notebook 72.10% Makefile 0.75% CSS 0.02% Batchfile 0.70%
machine-learning high-energy-physics python splot neural-networks boosting-algorithms reweighting-algorithms scikit-learn

hep_ml's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hep_ml's Issues

Negative sWeights

Hi,

I am trying to use the BoostingToUniformity notebook, in particular the uBoost classifier. I am getting the error message 'the weights should be non-negative'. I have tried removing this from the source code and tried to run uBoost without this line. When I use the 'predict' function I get an array of all zeros and when I try to plot the ROC curve I get nans as the output. I am wondering if there is a way of dealing with negative weights?

Many thanks,

Martha

Enhancements in API

Get rid of most parameters in updating regression tree.
There are too many parameters never used by any of loss functions.
Also, we can remove negative gradient from API

Leaves with no samples from original distribution

This issue was observed and reported by Jack Wimberley.

If there is a region with very few original samples, decision tree can build a leaf with samples only from target distribution (> min_samples_leaf) and 0 (exactly zero) from original.

As a result, 'corrections' made by a tree do not affect train weights, but this results in blowing up weights on the test.

Workarounds

Basically, almost anything from

  • increase min_samples_leaf
  • subsample=0.5
  • increase regularization (available in develop version)

(and any combination of the above) works well and resolves the problem in practice.

Proper solution (not available now)

Good, correct solution would be to introduce parameter 'minimal number of samples from original distribution in a leaf', but this isn't supported by decision trees of scikit-learn (or any other library).

List of gb_args options

Hi,
I'm using the Gradient Boosted Reweighter object GBReweighter from the hep_ml.reweight package.
I was wondering if it could be possible to have the list of the possible options that can be passed to the GBReweighter via the "gb_args" dictionary.

Thanks!
Pietro

compile with -fPIC option

hi, I'm trying to use the MLPClassifier from hep.nnet, when running the code I get an error that suggest to add -fPIC during the Python compilation.
I tried adding the -fPIC option in the makefile an also change from CFLAGS=... to CFLAGS+=... as suggested in some other discussions but both ways failed.
I attach in the zip both the error that I receive and the Makefile generated via ./configure, I build it with gcc48 on slc6 x86_64 system with an lxplus-like configuration.
Issues.zip
Somone have an idea what I'm doing wrong?

uBoost Convergence

Hello,

How could I check the convergence of uBoost when using uniforming_rate (alpha) != 0?. When I plot the log-loss metric vs number of boostings I see it increases, with a rate proportional to the alpha value used. You can see this trend in the plot attached. On the other hand, I can make the log-loss to converge with another hyper-parameter configuration (for the same alpha) but then I don't get an uniform selection. How can I deal with this?, does it mean that the log-loss is not a good metric to check convergence in this case?.
uboost_vs_adaboost.pdf

Thanks very much,
Gino

uBoost de-correlation power

Hello,

I'm trying to run uBoost to get a flat bkg efficiency with mass. In particular, I want the efficiency to be flat at 8% of bkg efficiency. To do this I used uBoostBDT and set 'target_efficiency'=0.08
and 'uniform_label': 0. I ran GridSearchCV to get the best hyper-parameters and trained on those for ~100 boostings and with different 'uniform_rate' values, e.g [0,5,10,15,20].
Looking at the bkg efficiency vs mass plots I see that at "bkg.eff. = 92%" profile gets gradually more flat, as 'uniform_rate' increases, which is exactly the behaviour I want, but for the wrong profile!. This made me suspect that to get a flat 8% bkg.eff I need to set 'target_efficiency'=0.92.

Looking at the code, I see there is a flip of sign in the clf. score
[https://github.com/arogozhnikov/hep_ml/blob/master/hep_ml/uboost.py#L182
self.signed_uniform_label = 2 * self.uniform_label - 1

[https://github.com/arogozhnikov/hep_ml/blob/master/hep_ml/uboost.py#L243]
signed_score = score * self.signed_uniform_label

So my interpretation of 'target_efficiency' is different for the two classes: when flattening signal, it is the amount of signal to keep; when flattening bkg, it is the amount of bkg to discard.

Is this reasoning correct?

Thanks in advance,
Gino

Error propagation from weights

Hello @arogozhnikov ,

I am using the GBReweighter. Before the weights are applied, I can assume that my datasets is described by

{x1, x2...}

after the weights are applied the dataset is:

{(xi, wi) : i in [1, n]}

i.e. it depends on the weights. Therefore if I had initially a function f (xi), now that function is f(xi, wi). The weights wi are dependent on the knowledge of the data (target) and simulation (original) distributions. However we have finite samples for these and the weights should be assigned an error so that we could estimate the propagated error in f(xi, wi). Is there a way to estimate the error in these wi weights? How are these errors correlated, because those correlations would be needed to estimate the propagated error on f(xi, wi).

Cheers.

Assertion Error with UGradientBoost

Hi I am wondering if you can help me.

I am getting the following error when trying to use UGradientBoost:

Traceback (most recent call last): File "uBoost_test.py", line 196, in <module> main() File "uBoost_test.py", line 30, in main train_classifier(dataframe, mode, year) File "uBoost_test.py", line 101, in train_classifier ugradientboost.fit(X_train, Y_train, w_train) File "/afs/cern.ch/user/m/mhilton/.local/lib/python3.6/site-packages/hep_ml/gradientboosting.py", line 205, in fit return UGradientBoostingBase.fit(self, X, y, sample_weight=sample_weight) File "/afs/cern.ch/user/m/mhilton/.local/lib/python3.6/site-packages/hep_ml/gradientboosting.py", line 131, in fit residual, weights = self.loss.prepare_tree_params(y_pred) File "/afs/cern.ch/user/m/mhilton/.local/lib/python3.6/site-packages/hep_ml/losses.py", line 118, in prepare_tree_params return self.negative_gradient(y_pred), numpy.ones(len(y_pred)) File "/afs/cern.ch/user/m/mhilton/.local/lib/python3.6/site-packages/hep_ml/losses.py", line 753, in negative_gradient neg_gradient = self._compute_fl_derivatives(y_pred) * self.fl_coefficient File "/afs/cern.ch/user/m/mhilton/.local/lib/python3.6/site-packages/hep_ml/losses.py", line 748, in _compute_fl_derivatives assert numpy.all(neg_gradient[~numpy.in1d(self.y, self.uniform_label)] == 0) AssertionError

I am wondering if you can explain this last assert line and why this is happening?

Many thanks.

search on hyper parameters

Hi,

I am using this package to reweight MC to look like sPlotted data, and I would like to scan the hyper parameters to look for the best configuration
scikit tools are available for this (e.g. GridSearchCV or RandomizedSearchCV), but I am having troubles interfacing the two packages
Has anyone done that? Are there alternative ways within hep_ml?

In particular, I have my pandas DataFrame for the original and target samples and I am trying something like

        GBreweighterPars = {"n_estimators"     : [10,500],
                            "learning_rate"    : [0.1, 1.0],
                            "max_depth"        : [1,5],
                            "min_samples_leaf" : [100,5000],
                            "subsample"        : [0.1, 1.0]}

        reweighter = reweight.GBReweighter(n_estimators     = GBreweighterPars["n_estimators"],
                                           learning_rate    = GBreweighterPars["learning_rate"],
                                           max_depth        = GBreweighterPars["max_depth"],
                                           min_samples_leaf = GBreweighterPars["min_samples_leaf"],
                                           gb_args          = {"subsample" : GBreweighterPars["subsample"]})

        gridSearch = GridSearchCV(reweighter, param_grid = GBreweighterPars)

        fit = gridSearch.fit(original, target)

but I get the following error

  File "mlWeight.py", line 273, in <module>
    rw = misc.reWeighter(ana, clSamples, inSamples, cutCL + " && " + cutEvt[i], cutMC + " && " + cutEvt[i], weightCL, weightMC, varsToMatch, varsToWatch, year, trigger, name + "_" + str(i), inName, useSW, search, test, add)
  File "/disk/moose/lhcb/simone/RD/Analysis/RKst/ml/misc.py", line 464, in reWeighter
    fit = gridSearch.fit(original, target)
  File "/disk/moose/lhcb/simone/software/anaconda/4.0.0/lib/python2.7/site-packages/sklearn/model_selection/_search.py", line 940, in fit
    return self._fit(X, y, groups, ParameterGrid(self.param_grid))
  File "/disk/moose/lhcb/simone/software/anaconda/4.0.0/lib/python2.7/site-packages/sklearn/model_selection/_search.py", line 539, in _fit
    self.scorer_ = check_scoring(self.estimator, scoring=self.scoring)
  File "/disk/moose/lhcb/simone/software/anaconda/4.0.0/lib/python2.7/site-packages/sklearn/metrics/scorer.py", line 273, in check_scoring
    "have a 'score' method. The estimator %r does not." % estimator)
TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator GBReweighter(gb_args={'subsample': [0.1, 1.0]}, learning_rate=[0.1, 1.0],
       max_depth=[1, 5], min_samples_leaf=[100, 5000],
       n_estimators=[10, 500]) does not.

However, I am not sure how to set the score method for GBReweighter

Any help/suggestions/examples would be much appreciated

Add MAELossFunction

mean absolute error (there will be some problems with predicting values in leaves), but still worth adding

Saving uboost BDT with tf/keras base estimators

Hi,

I am trying to use a uBoost BDT to achieve uniform signal efficiency. My base estimator is a Keras model (Tensorflow 2.2), which I have written as a scikit-learn BaseEstimator subclass using tensorflow.keras.wrappers.scikit_learn.KerasClassifier. The training and everything seems to work fine, but I am encountering an error when I try to save the uboost classifier with pickle/joblib. The error is TypeError: can't pickle _thread.RLock objects
(full error at bottom - it is mostly a long thread of calls to pickle )

From trying to look it up it seems the error is usually to do with the way tensorflow is run, but I'm only creating a simple model and fitting and all the session handling should be taken care of in this version of tf/keras. Maybe this answer is related keras-team/keras#8343 (comment)
ie. perhaps there is a call to something from the model that leaves an unserializable tensor object? As I am using the BDT not the classifier, I assume it is not to do with any parallel processes either?

Please let me know if you know what is causing the issue or if there is some way I can work around it.

Thanks!

image

Exporting/saving/reusing the reweighting formula

Sometimes one would like to use a control sample, e.g. because more abundant, to determine MC weights to be then applied to other, e.g. more rare, samples

For this reason it would be very useful if hep_ml.reweight could export the "reweighting formula" in some format, e.g. ROOT, so that it can be reused also from different programming languages

Thanks

sklearn deprecation warning

Importing hep_ml raises a deprecation warning for sklearn:

$ python -c 'from hep_ml import splot
/Users/apearce/.virtualenvs/foobar/lib/python2.7/site-packages/sklearn/cross_validation.py:44:
DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection
module into which all the refactored classes and functions are moved. Also note that the interface
of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

Issue interfacing uboost classifiers with rep grid search: Exception an integer is required

Hi,

I have been trying to add uBoost to my grid search in REP and have encountered some difficulties.
I have made a minmal example of the error I get:
`
df = pandas.DataFrame(np.random.randn(8, 4),columns=['A', 'B', 'C', 'D'])

df['E'] = 1

df['E'][3:] = 0

labels = df['E']

data = df.drop('E',axis=1)

uni_feats = 'C'

variables = ['A','B','D']

uboost_clf = uBoostClassifier(uniform_features=uni_feats, uniform_label=1,
train_features=variables)

grid_param = {}

grid_param['n_estimators'] = [50,100,125,150]

grid_param['n_neighbors'] = [50,51,52,53]

generator = RandomParameterOptimizer(grid_param,n_evaluations=2)
scorer = FoldingScorer(RocAuc(), folds=3, fold_checks=3)
estimator = SklearnClassifier(uboost_clf)
grid_finder = GridOptimalSearchCV(estimator, generator, scorer, parallel_profile='threads-4')
grid_finder.fit(data, labels)
`

This always results in the error:

Performing grid search in 4 threads ERROR:rep.metaml.gridsearch:Fail during training on the node Exception an integer is required Parameters n_estimators=150, n_neighbors=52 ERROR:rep.metaml.gridsearch:Fail during training on the node Exception an integer is required Parameters n_estimators=125, n_neighbors=52 2 evaluations done

I have had a look but i've had no luck finding the source of the exception and im a bit puzzled as to what is causing this, the same code works for a number of other classifiers.

Is this just a case of something which is not supported by uboost?

Any help or clarification here would be greatly appreciated,
Ryan

weight normalisation

I have used hep_ml in the past weeks to reweight MC distributions and stumbled upon the following issue
When determining weights as data/MC ratio of normalised distributions, the computed weights are normalised such as Sum w_i = N
However, I noticed this is not the case for weights obtained using hep_ml.reweight
Is this expected or am I missing something?

Question for multidimensional fit

Hi,

I am using your tool for reweighing MC to make it look like splotted data (thanks!)
I am currently following the examples and I am able to make it work for 1D, but I am not sure how to do it for more than one dimension. I mean, what I need is to find the MC weights when more than one variable presents disagreement between MC and data.

How do I do it?

Thanks!

Vicente Rives Molina

Strange values of bin edges

I found a very strange behaviour of the Lookup Classifier calculate bin edges feature.
The issue occurs only for integer type feature.
Here you have exemplary features edges
{"seed_nbIT", {0.0,0.0,0.0,}},
{"seed_nLayers", {11.0,12.0,12.0,}},
I attached the features distribution plots.
xxx

Multidimensional reweighting

Hi,

I'm using hep_ml to perform a multidimensional reweighting of a MC sample and it is working really well. I have a question, to which I've not been able to find an answer in the paper (https://arxiv.org/pdf/1608.05806.pdf) or in the documentation (https://arogozhnikov.github.io/hep_ml/reweight.html).

The multidimensional space is split in large bins by optimising a symmetrised chi2. But are those bins multidimensional or 1-dimensional ? In other words, is hep_ml reweighting 1D distributions iteratively ? Or is it reweighting multidimensional distributions directly ?

Cheers,
Maxime

Nominal weights when correcting already weighted original

Hi, I'm trying to correct the distribution D in an original (MC) sample that already has some weights, say w_i, that correct something else (say Dp). The way I'm currently doing this is I obtain weights, say x_i, by calling predict_weights(original = D_array, original_weight = w).
My question is the following: once I've done this, do I have to use x_i or w_i * x_i as nominal weights for my MC (i.e. to have both D and Dp corrected)? If the answer is x_i, then very naively one could assume that the ratio of the two sets of corrections (x_i, w_i) would yield something that corrects Dp but not D. Is this assumption correct?

Cheers,
Dan

Using sWeights with GBReweighter

Hi,

I've noticed some issues with very large weights using GBReweighter. I am trying to reweight data to look like some toy data I have. I have signal sWeights for the data but not for the toy data.
I've trained the BDT using only the original_weight but not target_weight arguments. Reading previous issues I've tried to ensure that there is overlap in my data and toy distributions. I'm using 800k toy and 500k data events which I think should be enough.
Do I need sWeights information for both datasets?
Without the sWeights the reweighting is much better, however how should I account for background if I don't use sWeights?

Thanks

withsWeightsTraining

withoutsWeightsTraining

New release?

Hey,
what do you think of a new release? The last release is over a year ago and some nice things have been added since (like loss_regularization parameter etc.).

Moreover, it is very unhandy to specify the github version as a pip dependency (most of all as an optional dependency), a new pip version would be really great.

Binned chi^2 definition

I saw your presentation on “Reweighting distributions with gradient boosting” and it looked great, so gave it go. But now I want to explain it to others, so have to actually understand how it works 😄

One thing I'm not certain on is how the node splitting is determined, i.e. how the value of the split of the training data along a feature axis at a node is determined. You say this is the “symmetrized binned chi^2”, and I'd like to check that my understanding of what that is is correct. I have a notebook to try reproduce your plot. It looks similar, but I might have done something wrong nevertheless. Does it look sensible?

I tried to find where this computation is done in the code, but I couldn't find it. I'm not at all familiar with the general scikit-learn code architecture, so it's just that I have trouble following the flow of all the Xs and ys. Could you point me to where the chi^2 computation is done?

(And, of course, thanks for the excellent package! 🍻)

Odd behaviour of GBReweighter

I am trying to use GBReweighter and am getting odd behaviour of the weights as seen in the figures attached. My parameters are as follows:
reweighter = GBReweighter(n_estimators=40, learning_rate=0.1, max_depth=3, min_samples_leaf=1000, gb_args={'subsample': 0.4})
I have tried varying the parameters.
Do you know what might cause this behaviour?
Many thanks.
D02KSPiPiDD_2012_original.pdf
D02KSPiPiDD_2012_reweighted.pdf

Speedup with XGboost classifier runtime error

I have tried to use speedup.LocukpClassifier with XGboost as a base_estimator but I failed. This may be a bug in LockupClassifier implementation.
I executed following python code:

train_X, test_X, train_Y, test_Y = train_test_split(new_features, target, random_state=42,train_size=0.5 )              

base_classifier = xgb.XGBClassifier(n_estimators=400, learning_rate=0.07 ,scale_pos_weight=ratio_ghost_to_good)
classifier = LookupClassifier(base_estimator=base_classifier, keep_trained_estimator=False)
classifier.fit(train_X, train_Y)

And obtained following error code

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-afeadcdc082e> in <module>()
      3 base_classifier = xgb.XGBClassifier(n_estimators=400, learning_rate=0.07 ,scale_pos_weight=ratio_ghost_to_good)
      4 classifier = LookupClassifier(base_estimator=base_classifier, keep_trained_estimator=False)
----> 5 classifier.fit(train_X, train_Y)

/afs/cern.ch/user/a/adendek/.local/lib/python2.7/site-packages/hep_ml/speedup.pyc in fit(self, X, y, sample_weight)
     91         all_lookup_indices = numpy.arange(int(n_parameter_combinations))
     92         all_combinations = self.convert_lookup_index_to_bins(all_lookup_indices)
---> 93         self._lookup_table = trained_estimator.predict_proba(all_combinations)
     94 
     95         if self.keep_trained_estimator:

/afs/cern.ch/user/a/adendek/.local/lib/python2.7/site-packages/xgboost/sklearn.pyc in predict_proba(self, data, output_margin, ntree_limit)
    475         class_probs = self.booster().predict(test_dmatrix,
    476                                              output_margin=output_margin,
--> 477                                              ntree_limit=ntree_limit)
    478         if self.objective == "multi:softprob":
    479             return class_probs

/afs/cern.ch/user/a/adendek/.local/lib/python2.7/site-packages/xgboost/core.pyc in predict(self, data, output_margin, ntree_limit, pred_leaf)
    937             option_mask |= 0x02
    938 
--> 939         self._validate_features(data)
    940 
    941         length = ctypes.c_ulong()

/afs/cern.ch/user/a/adendek/.local/lib/python2.7/site-packages/xgboost/core.pyc in _validate_features(self, data)
   1177 
   1178                 raise ValueError(msg.format(self.feature_names,
-> 1179                                             data.feature_names))
   1180 
   1181     def get_split_value_histogram(self, feature, fmap='', bins=None, as_pandas=True):

ValueError: feature_names mismatch: [u'seed_chi2PerDoF', u'seed_p', u'seed_pt', u'seed_nLHCbIDs', u'seed_nbIT', u'seed_nLayers', u'seed_x', u'seed_y', u'seed_tx', u'seed_ty', u'abs_seed_x', u'abs_seed_y', u'abs_seed_tx', u'abs_seed_ty', u'seed_r', u'pseudo_rapidity'] ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15']
expected seed_nbIT, abs_seed_y, abs_seed_x, seed_tx, seed_pt, seed_nLayers, seed_x, seed_y, seed_ty, pseudo_rapidity, seed_p, seed_r, abs_seed_tx, abs_seed_ty, seed_nLHCbIDs, seed_chi2PerDoF in input data
training data did not have the following fields: f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f12, f13, f10, f11, f14, f15

sPlot returns NAN sWeights

I am currently trying to use help_ml splot and am getting some sWeights as nan. I am using a sample of ~1.4M events and this seems to be happening after event ~200k. I have checked the signal and background probabilities and these look reasonable. I have also checked the sWeights before event ~200k and these also look reasonable. I have checked the sWeighted signal and background distributions for a relevant pT variable and these also look ok.

So I am wondering is there some reason they will not be calculated correctly after a certain event? Any help would be much appreciated.

image
image

image

Random behavior of GBReweighter and UGradientBoostingClassifier

(Leaving this as an open answer to common question)

Why GBReweighter/UGradientBoostingClassifier provide different weights after each training?

Both algorithms are based on stochastic tree boosting. Settings like subsample and max_features drive to randomized tree building (i.e. each tree uses only random part of train data), which is widely known to strengthen ensemble by building more diverse trees.

hep_ml follows sklearn convention to keep random things random unless explicitly asked otherwise.

Reproducible behavior is achieved with setting random_state

for boosting:
UGradientBoostingClassifier(<other setting here>, random_state=42)
for reweighter
GBReweighter(<other setting here>, gb_args={'random_state': 42, <other gb args>})

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.