Comments (7)
Hi Tommy,
maybe I misunderstand what you do, but it looks that you look at individual predictions of UboostBDTs, each for a specific efficiency. If that's true, I'm a bit surprised that result is so smooth... Part of Uboost thinking is to run multiple BDTs with different efficiencies to smooth out intrinsic variability of this process.
In estimating properties of final classifier, I would not recommend to look at individual components. Like, BDTs behavior is poorly explained by individual trees, and similarly Uboost behavior is poorly explained by looking at individual efficiency-targeted BDTs in it.
Instead, analyze the predictions of Uboost in general, make a sweep of thresholds and plot singal efficiency and background efficiency. I expect you plots to be more smooth (at the very least, both should be monotonically increasing)
Cheers
from hep_ml.
Hi Alex,
Thank you for your answer.
I understand better the conceptual workflow of uBoost now. However, I struggle on the technical side. I guess your imply to use the uBoostClassifier
class then and not uBoostBDT
individually ? I can make the latter work but not the former. I reckon uBoostClassifier
runs a given number of uBoostBDTs
with different target efficiencies, correct ? With uBoostClassifier
, all the events in my dataset are always classified as signal, no matter how I tune the parameters.
I'm obviously doing something wrong but I can't quite understand what :/
from hep_ml.
Yeah, the thinking is that you use full model (that is uBoost, that is ensemble of ensembles). It is notoriously slow, but that's how it was designed.
all the events in my dataset are always classified as signal
That's surprising. What about area under the ROC curve? It may be that all predictions are shifted (like, all probabilities are > 0.5), but resulting classifier still has ok properties in terms of sig vs bck separation and flatness
from hep_ml.
All the probabilities for the signal are between 0.5 and 0.7. I have a 100% signal and background efficiency in this case. I looked at the probabilities at every stage (and therefore at different target efficiencies) with the function staged_predict_proba()
, and the probabilities always look the same. The ROC curve and flatness don't make sense here.
I noticed this behaviour quite some time ago, which is why I stopped using the full classifier. uBoostBDT
works as you can see from the plots I showed in my first post.
from hep_ml.
I see, the reason is this squashing function,
https://github.com/arogozhnikov/hep_ml/blob/master/hep_ml/uboost.py#L540
it's not necessary and you can remove that if you don't like the range (just return score / self.efficiency_steps)
An important comment though is just don't interpret uBoost outputs as probabilities (I know name of function says so, but in reality you'll need additional steps to calibrate that to probability). Proper way to think about outputs as some new discriminating variable that is more useful than existing ones.
(So, that's not cool that uBoost returns output in such a narrow range, but that's not a problem either - users shouldn't expect it to behave like probs, and select thresholds according to their needs)
As of .predict method that is part of sklearn interface - there are practically no cases in HEP when you should use it. Better just forget about its existence :)
from hep_ml.
Thank you for all your answers ! I managed to make things work now.
I would still have one more question. I cannot see the parameters learning_rate
and uniforming_rate
in the uBoostClassifier
class but they do exist in uBoostBDT
. As they could be particularly important, I'm wondering why they are not included. Is there any particular reason ? I couldn't find any answer in the documentation.
from hep_ml.
As they could be particularly important, I'm wondering why they are not included. Is there any particular reason ?
Not really, they can be exposed.
Just at that time idea was to follow original uBoost paper (and in original paper, there is a modification of 'vanilla' adaboost, which does not have learning rate as a parameter). From a practical perspective, I think LR would be very helpful.
from hep_ml.
Related Issues (20)
- Random behavior of GBReweighter and UGradientBoostingClassifier
- Nominal weights when correcting already weighted original HOT 1
- Assertion Error with UGradientBoost HOT 1
- sPlot returns NAN sWeights HOT 3
- Odd behaviour of GBReweighter HOT 3
- Using sWeights with GBReweighter HOT 1
- Saving uboost BDT with tf/keras base estimators HOT 5
- Persistify GBReweighter instance HOT 1
- Error propagation from weights HOT 6
- Create a new release? HOT 1
- Theano is going away HOT 1
- Benchmark with independent classification model HOT 3
- New release? HOT 2
- GBReweighter KeyError: 'squared_error' ?? HOT 7
- Porting loss function to XGBoost HOT 1
- numpy.float and numpy.int deprecated/removed in newer versions of numpy HOT 3
- GPU Acceleration in GBDT HOT 6
- Documenting behavior of normalization HOT 1
- GBReweights seems to be not working in my case HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hep_ml.