arogozhnikov / hep_ml Goto Github PK
View Code? Open in Web Editor NEWMachine Learning for High Energy Physics.
Home Page: https://arogozhnikov.github.io/hep_ml/
License: Other
Machine Learning for High Energy Physics.
Home Page: https://arogozhnikov.github.io/hep_ml/
License: Other
Currently regression lacks initial constant, which may be resolved i.e. by iterative auto algorithm. Constants for losses are substituted by reweighting
Hi,
I am trying to use the BoostingToUniformity notebook, in particular the uBoost classifier. I am getting the error message 'the weights should be non-negative'. I have tried removing this from the source code and tried to run uBoost without this line. When I use the 'predict' function I get an array of all zeros and when I try to plot the ROC curve I get nans as the output. I am wondering if there is a way of dealing with negative weights?
Many thanks,
Martha
Things to note
Get rid of most parameters in updating regression tree.
There are too many parameters never used by any of loss functions.
Also, we can remove negative gradient from API
This issue was observed and reported by Jack Wimberley.
If there is a region with very few original samples, decision tree can build a leaf with samples only from target distribution (> min_samples_leaf) and 0 (exactly zero) from original.
As a result, 'corrections' made by a tree do not affect train weights, but this results in blowing up weights on the test.
Basically, almost anything from
(and any combination of the above) works well and resolves the problem in practice.
Good, correct solution would be to introduce parameter 'minimal number of samples from original distribution in a leaf', but this isn't supported by decision trees of scikit-learn (or any other library).
Hi,
I'm using the Gradient Boosted Reweighter object GBReweighter from the hep_ml.reweight package.
I was wondering if it could be possible to have the list of the possible options that can be passed to the GBReweighter via the "gb_args" dictionary.
Thanks!
Pietro
hi, I'm trying to use the MLPClassifier from hep.nnet, when running the code I get an error that suggest to add -fPIC during the Python compilation.
I tried adding the -fPIC option in the makefile an also change from CFLAGS=... to CFLAGS+=... as suggested in some other discussions but both ways failed.
I attach in the zip both the error that I receive and the Makefile generated via ./configure, I build it with gcc48 on slc6 x86_64 system with an lxplus-like configuration.
Issues.zip
Somone have an idea what I'm doing wrong?
This will probably resolve an issue of information being at several places at once
Create a separate branch and test with wget and raw.github.com
Hello,
How could I check the convergence of uBoost when using uniforming_rate (alpha) != 0?. When I plot the log-loss metric vs number of boostings I see it increases, with a rate proportional to the alpha value used. You can see this trend in the plot attached. On the other hand, I can make the log-loss to converge with another hyper-parameter configuration (for the same alpha) but then I don't get an uniform selection. How can I deal with this?, does it mean that the log-loss is not a good metric to check convergence in this case?.
uboost_vs_adaboost.pdf
Thanks very much,
Gino
Currently many loss function simply ignore this step.
Alternatively, we can add 'find a constant' procedure to get right constant.
This option is more general.
Hello,
I'm trying to run uBoost to get a flat bkg efficiency with mass. In particular, I want the efficiency to be flat at 8% of bkg efficiency. To do this I used uBoostBDT and set 'target_efficiency'=0.08
and 'uniform_label': 0. I ran GridSearchCV to get the best hyper-parameters and trained on those for ~100 boostings and with different 'uniform_rate' values, e.g [0,5,10,15,20].
Looking at the bkg efficiency vs mass plots I see that at "bkg.eff. = 92%" profile gets gradually more flat, as 'uniform_rate' increases, which is exactly the behaviour I want, but for the wrong profile!. This made me suspect that to get a flat 8% bkg.eff I need to set 'target_efficiency'=0.92.
Looking at the code, I see there is a flip of sign in the clf. score
[https://github.com/arogozhnikov/hep_ml/blob/master/hep_ml/uboost.py#L182
self.signed_uniform_label = 2 * self.uniform_label - 1
[https://github.com/arogozhnikov/hep_ml/blob/master/hep_ml/uboost.py#L243]
signed_score = score * self.signed_uniform_label
So my interpretation of 'target_efficiency' is different for the two classes: when flattening signal, it is the amount of signal to keep; when flattening bkg, it is the amount of bkg to discard.
Is this reasoning correct?
Thanks in advance,
Gino
Hello @arogozhnikov ,
I am using the GBReweighter. Before the weights are applied, I can assume that my datasets is described by
{x1, x2...}
after the weights are applied the dataset is:
{(xi, wi) : i in [1, n]}
i.e. it depends on the weights. Therefore if I had initially a function f (xi), now that function is f(xi, wi). The weights wi are dependent on the knowledge of the data (target) and simulation (original) distributions. However we have finite samples for these and the weights should be assigned an error so that we could estimate the propagated error in f(xi, wi). Is there a way to estimate the error in these wi weights? How are these errors correlated, because those correlations would be needed to estimate the propagated error on f(xi, wi).
Cheers.
Hi I am wondering if you can help me.
I am getting the following error when trying to use UGradientBoost:
Traceback (most recent call last): File "uBoost_test.py", line 196, in <module> main() File "uBoost_test.py", line 30, in main train_classifier(dataframe, mode, year) File "uBoost_test.py", line 101, in train_classifier ugradientboost.fit(X_train, Y_train, w_train) File "/afs/cern.ch/user/m/mhilton/.local/lib/python3.6/site-packages/hep_ml/gradientboosting.py", line 205, in fit return UGradientBoostingBase.fit(self, X, y, sample_weight=sample_weight) File "/afs/cern.ch/user/m/mhilton/.local/lib/python3.6/site-packages/hep_ml/gradientboosting.py", line 131, in fit residual, weights = self.loss.prepare_tree_params(y_pred) File "/afs/cern.ch/user/m/mhilton/.local/lib/python3.6/site-packages/hep_ml/losses.py", line 118, in prepare_tree_params return self.negative_gradient(y_pred), numpy.ones(len(y_pred)) File "/afs/cern.ch/user/m/mhilton/.local/lib/python3.6/site-packages/hep_ml/losses.py", line 753, in negative_gradient neg_gradient = self._compute_fl_derivatives(y_pred) * self.fl_coefficient File "/afs/cern.ch/user/m/mhilton/.local/lib/python3.6/site-packages/hep_ml/losses.py", line 748, in _compute_fl_derivatives assert numpy.all(neg_gradient[~numpy.in1d(self.y, self.uniform_label)] == 0) AssertionError
I am wondering if you can explain this last assert
line and why this is happening?
Many thanks.
Hi,
I am using this package to reweight MC to look like sPlotted data, and I would like to scan the hyper parameters to look for the best configuration
scikit tools are available for this (e.g. GridSearchCV or RandomizedSearchCV), but I am having troubles interfacing the two packages
Has anyone done that? Are there alternative ways within hep_ml?
In particular, I have my pandas DataFrame for the original and target samples and I am trying something like
GBreweighterPars = {"n_estimators" : [10,500],
"learning_rate" : [0.1, 1.0],
"max_depth" : [1,5],
"min_samples_leaf" : [100,5000],
"subsample" : [0.1, 1.0]}
reweighter = reweight.GBReweighter(n_estimators = GBreweighterPars["n_estimators"],
learning_rate = GBreweighterPars["learning_rate"],
max_depth = GBreweighterPars["max_depth"],
min_samples_leaf = GBreweighterPars["min_samples_leaf"],
gb_args = {"subsample" : GBreweighterPars["subsample"]})
gridSearch = GridSearchCV(reweighter, param_grid = GBreweighterPars)
fit = gridSearch.fit(original, target)
but I get the following error
File "mlWeight.py", line 273, in <module>
rw = misc.reWeighter(ana, clSamples, inSamples, cutCL + " && " + cutEvt[i], cutMC + " && " + cutEvt[i], weightCL, weightMC, varsToMatch, varsToWatch, year, trigger, name + "_" + str(i), inName, useSW, search, test, add)
File "/disk/moose/lhcb/simone/RD/Analysis/RKst/ml/misc.py", line 464, in reWeighter
fit = gridSearch.fit(original, target)
File "/disk/moose/lhcb/simone/software/anaconda/4.0.0/lib/python2.7/site-packages/sklearn/model_selection/_search.py", line 940, in fit
return self._fit(X, y, groups, ParameterGrid(self.param_grid))
File "/disk/moose/lhcb/simone/software/anaconda/4.0.0/lib/python2.7/site-packages/sklearn/model_selection/_search.py", line 539, in _fit
self.scorer_ = check_scoring(self.estimator, scoring=self.scoring)
File "/disk/moose/lhcb/simone/software/anaconda/4.0.0/lib/python2.7/site-packages/sklearn/metrics/scorer.py", line 273, in check_scoring
"have a 'score' method. The estimator %r does not." % estimator)
TypeError: If no scoring is specified, the estimator passed should have a 'score' method. The estimator GBReweighter(gb_args={'subsample': [0.1, 1.0]}, learning_rate=[0.1, 1.0],
max_depth=[1, 5], min_samples_leaf=[100, 5000],
n_estimators=[10, 500]) does not.
However, I am not sure how to set the score method for GBReweighter
Any help/suggestions/examples would be much appreciated
mean absolute error (there will be some problems with predicting values in leaves), but still worth adding
Hi,
I am trying to use a uBoost BDT to achieve uniform signal efficiency. My base estimator is a Keras model (Tensorflow 2.2), which I have written as a scikit-learn BaseEstimator subclass using tensorflow.keras.wrappers.scikit_learn.KerasClassifier. The training and everything seems to work fine, but I am encountering an error when I try to save the uboost classifier with pickle/joblib. The error is TypeError: can't pickle _thread.RLock objects
(full error at bottom - it is mostly a long thread of calls to pickle )
From trying to look it up it seems the error is usually to do with the way tensorflow is run, but I'm only creating a simple model and fitting and all the session handling should be taken care of in this version of tf/keras. Maybe this answer is related keras-team/keras#8343 (comment)
ie. perhaps there is a call to something from the model that leaves an unserializable tensor object? As I am using the BDT not the classifier, I assume it is not to do with any parallel processes either?
Please let me know if you know what is causing the issue or if there is some way I can work around it.
Thanks!
Some optimizers of classifier's application
Sometimes one would like to use a control sample, e.g. because more abundant, to determine MC weights to be then applied to other, e.g. more rare, samples
For this reason it would be very useful if hep_ml.reweight could export the "reweighting formula" in some format, e.g. ROOT, so that it can be reused also from different programming languages
Thanks
Importing hep_ml
raises a deprecation warning for sklearn
:
$ python -c 'from hep_ml import splot
/Users/apearce/.virtualenvs/foobar/lib/python2.7/site-packages/sklearn/cross_validation.py:44:
DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection
module into which all the refactored classes and functions are moved. Also note that the interface
of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
For consistency with REP (and the whole ML world):
To demonstrate fighting with correlation
Hi,
I have been trying to add uBoost to my grid search in REP and have encountered some difficulties.
I have made a minmal example of the error I get:
`
df = pandas.DataFrame(np.random.randn(8, 4),columns=['A', 'B', 'C', 'D'])
df['E'] = 1
df['E'][3:] = 0
labels = df['E']
data = df.drop('E',axis=1)
uni_feats = 'C'
variables = ['A','B','D']
uboost_clf = uBoostClassifier(uniform_features=uni_feats, uniform_label=1,
train_features=variables)
grid_param = {}
grid_param['n_estimators'] = [50,100,125,150]
grid_param['n_neighbors'] = [50,51,52,53]
generator = RandomParameterOptimizer(grid_param,n_evaluations=2)
scorer = FoldingScorer(RocAuc(), folds=3, fold_checks=3)
estimator = SklearnClassifier(uboost_clf)
grid_finder = GridOptimalSearchCV(estimator, generator, scorer, parallel_profile='threads-4')
grid_finder.fit(data, labels)
`
This always results in the error:
Performing grid search in 4 threads ERROR:rep.metaml.gridsearch:Fail during training on the node Exception an integer is required Parameters n_estimators=150, n_neighbors=52 ERROR:rep.metaml.gridsearch:Fail during training on the node Exception an integer is required Parameters n_estimators=125, n_neighbors=52 2 evaluations done
I have had a look but i've had no luck finding the source of the exception and im a bit puzzled as to what is causing this, the same code works for a number of other classifiers.
Is this just a case of something which is not supported by uboost?
Any help or clarification here would be greatly appreciated,
Ryan
subj. This also shall be mentioned in documentation and howto notebook
finally resolve this names and strategy of passing 'neighbor features'
Hello Alex,
I am using the GBReweighter class:
and I would like to know if there is a native way to persistify an instance of this class so that it can be used to predict_weights afterwards. I cannot find anything native in the documentation and I was thinking that maybe something like pickle might work.
Cheers.
It worth to add special notebook with explanations of sPlot (and probably some examples of fitting).
I have used hep_ml in the past weeks to reweight MC distributions and stumbled upon the following issue
When determining weights as data/MC ratio of normalised distributions, the computed weights are normalised such as Sum w_i = N
However, I noticed this is not the case for weights obtained using hep_ml.reweight
Is this expected or am I missing something?
Add support of noise and dropouts.
Need to add sWeights as default behavior. And set original weights to ones.
Either put uniform_label=0
everywhere or leave without default value.
Hi,
I am using your tool for reweighing MC to make it look like splotted data (thanks!)
I am currently following the examples and I am able to make it work for 1D, but I am not sure how to do it for more than one dimension. I mean, what I need is to find the MC weights when more than one variable presents disagreement between MC and data.
How do I do it?
Thanks!
Vicente Rives Molina
Probably some test notebook would be helpful to see how changes in losses affect training process.
Hi,
I'm using hep_ml to perform a multidimensional reweighting of a MC sample and it is working really well. I have a question, to which I've not been able to find an answer in the paper (https://arxiv.org/pdf/1608.05806.pdf) or in the documentation (https://arogozhnikov.github.io/hep_ml/reweight.html).
The multidimensional space is split in large bins by optimising a symmetrised chi2. But are those bins multidimensional or 1-dimensional ? In other words, is hep_ml reweighting 1D distributions iteratively ? Or is it reweighting multidimensional distributions directly ?
Cheers,
Maxime
Hi, I'm trying to correct the distribution D in an original (MC) sample that already has some weights, say w_i, that correct something else (say Dp). The way I'm currently doing this is I obtain weights, say x_i, by calling predict_weights(original = D_array, original_weight = w)
.
My question is the following: once I've done this, do I have to use x_i or w_i * x_i as nominal weights for my MC (i.e. to have both D and Dp corrected)? If the answer is x_i, then very naively one could assume that the ratio of the two sets of corrections (x_i, w_i) would yield something that corrects Dp but not D. Is this assumption correct?
Cheers,
Dan
Hi,
I've noticed some issues with very large weights using GBReweighter. I am trying to reweight data to look like some toy data I have. I have signal sWeights for the data but not for the toy data.
I've trained the BDT using only the original_weight but not target_weight arguments. Reading previous issues I've tried to ensure that there is overlap in my data and toy distributions. I'm using 800k toy and 500k data events which I think should be enough.
Do I need sWeights information for both datasets?
Without the sWeights the reweighting is much better, however how should I account for background if I don't use sWeights?
Thanks
Hey,
what do you think of a new release? The last release is over a year ago and some nice things have been added since (like loss_regularization parameter etc.).
Moreover, it is very unhandy to specify the github version as a pip dependency (most of all as an optional dependency), a new pip version would be really great.
I saw your presentation on “Reweighting distributions with gradient boosting” and it looked great, so gave it go. But now I want to explain it to others, so have to actually understand how it works 😄
One thing I'm not certain on is how the node splitting is determined, i.e. how the value of the split of the training data along a feature axis at a node is determined. You say this is the “symmetrized binned chi^2”, and I'd like to check that my understanding of what that is is correct. I have a notebook to try reproduce your plot. It looks similar, but I might have done something wrong nevertheless. Does it look sensible?
I tried to find where this computation is done in the code, but I couldn't find it. I'm not at all familiar with the general scikit-learn code architecture, so it's just that I have trouble following the flow of all the X
s and y
s. Could you point me to where the chi^2 computation is done?
(And, of course, thanks for the excellent package! 🍻)
I am trying to use GBReweighter and am getting odd behaviour of the weights as seen in the figures attached. My parameters are as follows:
reweighter = GBReweighter(n_estimators=40, learning_rate=0.1, max_depth=3, min_samples_leaf=1000, gb_args={'subsample': 0.4})
I have tried varying the parameters.
Do you know what might cause this behaviour?
Many thanks.
D02KSPiPiDD_2012_original.pdf
D02KSPiPiDD_2012_reweighted.pdf
I have tried to use speedup.LocukpClassifier with XGboost as a base_estimator but I failed. This may be a bug in LockupClassifier implementation.
I executed following python code:
train_X, test_X, train_Y, test_Y = train_test_split(new_features, target, random_state=42,train_size=0.5 )
base_classifier = xgb.XGBClassifier(n_estimators=400, learning_rate=0.07 ,scale_pos_weight=ratio_ghost_to_good)
classifier = LookupClassifier(base_estimator=base_classifier, keep_trained_estimator=False)
classifier.fit(train_X, train_Y)
And obtained following error code
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-11-afeadcdc082e> in <module>()
3 base_classifier = xgb.XGBClassifier(n_estimators=400, learning_rate=0.07 ,scale_pos_weight=ratio_ghost_to_good)
4 classifier = LookupClassifier(base_estimator=base_classifier, keep_trained_estimator=False)
----> 5 classifier.fit(train_X, train_Y)
/afs/cern.ch/user/a/adendek/.local/lib/python2.7/site-packages/hep_ml/speedup.pyc in fit(self, X, y, sample_weight)
91 all_lookup_indices = numpy.arange(int(n_parameter_combinations))
92 all_combinations = self.convert_lookup_index_to_bins(all_lookup_indices)
---> 93 self._lookup_table = trained_estimator.predict_proba(all_combinations)
94
95 if self.keep_trained_estimator:
/afs/cern.ch/user/a/adendek/.local/lib/python2.7/site-packages/xgboost/sklearn.pyc in predict_proba(self, data, output_margin, ntree_limit)
475 class_probs = self.booster().predict(test_dmatrix,
476 output_margin=output_margin,
--> 477 ntree_limit=ntree_limit)
478 if self.objective == "multi:softprob":
479 return class_probs
/afs/cern.ch/user/a/adendek/.local/lib/python2.7/site-packages/xgboost/core.pyc in predict(self, data, output_margin, ntree_limit, pred_leaf)
937 option_mask |= 0x02
938
--> 939 self._validate_features(data)
940
941 length = ctypes.c_ulong()
/afs/cern.ch/user/a/adendek/.local/lib/python2.7/site-packages/xgboost/core.pyc in _validate_features(self, data)
1177
1178 raise ValueError(msg.format(self.feature_names,
-> 1179 data.feature_names))
1180
1181 def get_split_value_histogram(self, feature, fmap='', bins=None, as_pandas=True):
ValueError: feature_names mismatch: [u'seed_chi2PerDoF', u'seed_p', u'seed_pt', u'seed_nLHCbIDs', u'seed_nbIT', u'seed_nLayers', u'seed_x', u'seed_y', u'seed_tx', u'seed_ty', u'abs_seed_x', u'abs_seed_y', u'abs_seed_tx', u'abs_seed_ty', u'seed_r', u'pseudo_rapidity'] ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15']
expected seed_nbIT, abs_seed_y, abs_seed_x, seed_tx, seed_pt, seed_nLayers, seed_x, seed_y, seed_ty, pseudo_rapidity, seed_p, seed_r, abs_seed_tx, abs_seed_ty, seed_nLHCbIDs, seed_chi2PerDoF in input data
training data did not have the following fields: f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f12, f13, f10, f11, f14, f15
I am currently trying to use help_ml splot and am getting some sWeights as nan. I am using a sample of ~1.4M events and this seems to be happening after event ~200k. I have checked the signal and background probabilities and these look reasonable. I have also checked the sWeights before event ~200k and these also look reasonable. I have checked the sWeighted signal and background distributions for a relevant pT variable and these also look ok.
So I am wondering is there some reason they will not be calculated correctly after a certain event? Any help would be much appreciated.
(Leaving this as an open answer to common question)
Why GBReweighter/UGradientBoostingClassifier provide different weights after each training?
Both algorithms are based on stochastic tree boosting. Settings like subsample
and max_features
drive to randomized tree building (i.e. each tree uses only random part of train data), which is widely known to strengthen ensemble by building more diverse trees.
hep_ml
follows sklearn
convention to keep random things random unless explicitly asked otherwise.
Reproducible behavior is achieved with setting random_state
for boosting:
UGradientBoostingClassifier(<other setting here>, random_state=42)
for reweighter
GBReweighter(<other setting here>, gb_args={'random_state': 42, <other gb args>})
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.