Hi everyone, I'm currently using uBoost for my Belle II analysis. Fo

I see, the reason is this squashing function, <a href="https://github.com/arogozhn

Large variations in signal/background distributions about hep_ml HOT 7 OPEN

TommyDESY commented on July 20, 2024

Large variations in signal/background distributions

from hep_ml.

Comments (7)

arogozhnikov commented on July 20, 2024

Hi Tommy,
maybe I misunderstand what you do, but it looks that you look at individual predictions of UboostBDTs, each for a specific efficiency. If that's true, I'm a bit surprised that result is so smooth... Part of Uboost thinking is to run multiple BDTs with different efficiencies to smooth out intrinsic variability of this process.

In estimating properties of final classifier, I would not recommend to look at individual components. Like, BDTs behavior is poorly explained by individual trees, and similarly Uboost behavior is poorly explained by looking at individual efficiency-targeted BDTs in it.

Instead, analyze the predictions of Uboost in general, make a sweep of thresholds and plot singal efficiency and background efficiency. I expect you plots to be more smooth (at the very least, both should be monotonically increasing)

Cheers

from hep_ml.

TommyDESY commented on July 20, 2024

Hi Alex,

Thank you for your answer.

I understand better the conceptual workflow of uBoost now. However, I struggle on the technical side. I guess your imply to use the uBoostClassifier class then and not uBoostBDT individually ? I can make the latter work but not the former. I reckon uBoostClassifier runs a given number of uBoostBDTs with different target efficiencies, correct ? With uBoostClassifier, all the events in my dataset are always classified as signal, no matter how I tune the parameters.

I'm obviously doing something wrong but I can't quite understand what :/

from hep_ml.

arogozhnikov commented on July 20, 2024

Yeah, the thinking is that you use full model (that is uBoost, that is ensemble of ensembles). It is notoriously slow, but that's how it was designed.

all the events in my dataset are always classified as signal

That's surprising. What about area under the ROC curve? It may be that all predictions are shifted (like, all probabilities are > 0.5), but resulting classifier still has ok properties in terms of sig vs bck separation and flatness

from hep_ml.

TommyDESY commented on July 20, 2024

All the probabilities for the signal are between 0.5 and 0.7. I have a 100% signal and background efficiency in this case. I looked at the probabilities at every stage (and therefore at different target efficiencies) with the function staged_predict_proba(), and the probabilities always look the same. The ROC curve and flatness don't make sense here.

I noticed this behaviour quite some time ago, which is why I stopped using the full classifier. uBoostBDT works as you can see from the plots I showed in my first post.

from hep_ml.

arogozhnikov commented on July 20, 2024

I see, the reason is this squashing function,
https://github.com/arogozhnikov/hep_ml/blob/master/hep_ml/uboost.py#L540
it's not necessary and you can remove that if you don't like the range (just return score / self.efficiency_steps)

An important comment though is just don't interpret uBoost outputs as probabilities (I know name of function says so, but in reality you'll need additional steps to calibrate that to probability). Proper way to think about outputs as some new discriminating variable that is more useful than existing ones.

(So, that's not cool that uBoost returns output in such a narrow range, but that's not a problem either - users shouldn't expect it to behave like probs, and select thresholds according to their needs)

As of .predict method that is part of sklearn interface - there are practically no cases in HEP when you should use it. Better just forget about its existence :)

from hep_ml.

TommyDESY commented on July 20, 2024

Thank you for all your answers ! I managed to make things work now.

I would still have one more question. I cannot see the parameters learning_rate and uniforming_rate in the uBoostClassifier class but they do exist in uBoostBDT. As they could be particularly important, I'm wondering why they are not included. Is there any particular reason ? I couldn't find any answer in the documentation.

from hep_ml.

arogozhnikov commented on July 20, 2024

As they could be particularly important, I'm wondering why they are not included. Is there any particular reason ?

Not really, they can be exposed.

Just at that time idea was to follow original uBoost paper (and in original paper, there is a modification of 'vanilla' adaboost, which does not have learning rate as a parameter). From a practical perspective, I think LR would be very helpful.

from hep_ml.

Large variations in signal/background distributions about hep_ml HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent