Coder Social home page Coder Social logo

Comments (7)

arogozhnikov avatar arogozhnikov commented on July 20, 2024

Hi Tommy,
maybe I misunderstand what you do, but it looks that you look at individual predictions of UboostBDTs, each for a specific efficiency. If that's true, I'm a bit surprised that result is so smooth... Part of Uboost thinking is to run multiple BDTs with different efficiencies to smooth out intrinsic variability of this process.

In estimating properties of final classifier, I would not recommend to look at individual components. Like, BDTs behavior is poorly explained by individual trees, and similarly Uboost behavior is poorly explained by looking at individual efficiency-targeted BDTs in it.

Instead, analyze the predictions of Uboost in general, make a sweep of thresholds and plot singal efficiency and background efficiency. I expect you plots to be more smooth (at the very least, both should be monotonically increasing)

Cheers

from hep_ml.

TommyDESY avatar TommyDESY commented on July 20, 2024

Hi Alex,

Thank you for your answer.

I understand better the conceptual workflow of uBoost now. However, I struggle on the technical side. I guess your imply to use the uBoostClassifier class then and not uBoostBDT individually ? I can make the latter work but not the former. I reckon uBoostClassifier runs a given number of uBoostBDTs with different target efficiencies, correct ? With uBoostClassifier, all the events in my dataset are always classified as signal, no matter how I tune the parameters.

I'm obviously doing something wrong but I can't quite understand what :/

from hep_ml.

arogozhnikov avatar arogozhnikov commented on July 20, 2024

Yeah, the thinking is that you use full model (that is uBoost, that is ensemble of ensembles). It is notoriously slow, but that's how it was designed.

all the events in my dataset are always classified as signal

That's surprising. What about area under the ROC curve? It may be that all predictions are shifted (like, all probabilities are > 0.5), but resulting classifier still has ok properties in terms of sig vs bck separation and flatness

from hep_ml.

TommyDESY avatar TommyDESY commented on July 20, 2024

All the probabilities for the signal are between 0.5 and 0.7. I have a 100% signal and background efficiency in this case. I looked at the probabilities at every stage (and therefore at different target efficiencies) with the function staged_predict_proba(), and the probabilities always look the same. The ROC curve and flatness don't make sense here.

proba_uboostclassifier

I noticed this behaviour quite some time ago, which is why I stopped using the full classifier. uBoostBDT works as you can see from the plots I showed in my first post.

from hep_ml.

arogozhnikov avatar arogozhnikov commented on July 20, 2024

I see, the reason is this squashing function,
https://github.com/arogozhnikov/hep_ml/blob/master/hep_ml/uboost.py#L540
it's not necessary and you can remove that if you don't like the range (just return score / self.efficiency_steps)

An important comment though is just don't interpret uBoost outputs as probabilities (I know name of function says so, but in reality you'll need additional steps to calibrate that to probability). Proper way to think about outputs as some new discriminating variable that is more useful than existing ones.

(So, that's not cool that uBoost returns output in such a narrow range, but that's not a problem either - users shouldn't expect it to behave like probs, and select thresholds according to their needs)

As of .predict method that is part of sklearn interface - there are practically no cases in HEP when you should use it. Better just forget about its existence :)

from hep_ml.

TommyDESY avatar TommyDESY commented on July 20, 2024

Thank you for all your answers ! I managed to make things work now.

I would still have one more question. I cannot see the parameters learning_rate and uniforming_rate in the uBoostClassifier class but they do exist in uBoostBDT. As they could be particularly important, I'm wondering why they are not included. Is there any particular reason ? I couldn't find any answer in the documentation.

from hep_ml.

arogozhnikov avatar arogozhnikov commented on July 20, 2024

As they could be particularly important, I'm wondering why they are not included. Is there any particular reason ?

Not really, they can be exposed.

Just at that time idea was to follow original uBoost paper (and in original paper, there is a modification of 'vanilla' adaboost, which does not have learning rate as a parameter). From a practical perspective, I think LR would be very helpful.

from hep_ml.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.