Coder Social home page Coder Social logo

Comments (10)

Balandat avatar Balandat commented on April 28, 2024 2

I can get very excited.

When it comes to Bayesian Optimization, we all do :)

from ax.

andrejmuhic avatar andrejmuhic commented on April 28, 2024 1

Hi, @Balandat

In general, yes, though you have to be careful to make sure that the transform is consistent for all the data. Like, you can't just find the "best normalizing transform" for data of trials 1-10, pass that transformed data to Ax, and then compute another "best normalizing transform" based on data of trials 1-20 but only pass the transformed outcomes of trials 11-20 to Ax.

Actually the package that I pasted finds the best transformation using cross validation, which is kind off the right thing to do statistically.

The transformation has to be applied consistently across all data (and care has to be taken if there are things like outcome constraints or relative effects). FWIW, we do some data transformation automatically in Ax (such as standardizing or winsorizing), but we don't automatically apply log transforms or other kinds of transforms (this is something we could do more of).

Yes I am aware that same transformation needs to be applied to all data.
I have actually carefully inspected source code and the documentation. I do not like running what I do not understand to at least some level. For example I carefully studied what is done automatically as I was initially surprised that some recommended normalizations did have no effect.

    "GPEI": ModelSetup(
        bridge_class=TorchModelBridge,
        model_class=BotorchModel,
        transforms=Cont_X_trans + Y_trans,
        standard_bridge_kwargs=STANDARD_TORCH_BRIDGE_KWARGS,
    ),
    "Sobol": ModelSetup(
        bridge_class=RandomModelBridge,
        model_class=SobolGenerator,
        transforms=Cont_X_trans,
    ),

If the form of your data transform doesn't change, you'll be safe; One option would be to collect a bunch of observations (or use those that you already have), compute the "best normalizing transform" based on those, and then create a new Ax experiment where you add these results as manual arms, and then use the same transform for all subsequent results obtained on any new trials suggested by Ax.

Actually I have already coded this before I saw your post, just waiting that I have enough samples. We are using manual stuff extensively already.

Moreover I am actually adding diagnostics, quality of fit, variable importance etc.
Things described here:
https://ax.dev/docs/models.html and here https://ax.dev/versions/0.1.5/tutorials/visualizations.html
I will report back how it goes, including the transformations.

I am actually glad that I asked this question as I started going in too much detail (like starting to read botorch source of qNoisyExpectedImprovement and reading the paper to see if it makes sense changing) before having proper basic analysis. I can get very excited.

Thank you for chatting with me.

from ax.

Balandat avatar Balandat commented on April 28, 2024 1

transformation is refitted every time when new data comes in, is this correct? I guess this is fine as the transformations are invertible but it means that kernel matrix is recomputed every update.

Yes that is correct. The Ax paradigm generally is that it handles both transforming data before passing it to the lower level & optimization layer and then un-transforming the predictions returned by the modeling & optimization layer. And yes, we usually construct a new model upon generating new points rather than updating an existing one (unless there is no new data in which case we use the old one). I guess it would be possible to try and avoid that to cache some kernel matrices etc. but that would get quite messy really fast with all the complexities of transforms and so on. So our design choice here is to re-fit the transform every time there is new data, which is generally pretty low overhead relative to the actual GP fitting and acquisition function optimization.

I would just like to make sure that I am doing right thing. The easiest thing for me would be to apply Winsorize (but one cannot invert this one obviously) and PowerTransformY that are already available, for this I need to pass the configs for this transformations, maybe there is an example how to this with ModelSetup. I have the same question again, it seems that the transformations are refitted when the new data comes in refit_on_update=True, is this technically correct or I should use my own as I am doing now.

Yes this should be fine, this setup gets you around shifting the data to positive values outside of Ax with a global offset (though you could still do that even with the above proposed transforms if you remember applying that shift at the end). Winsorization does work quite well in practice and we use it a lot ourselves, so this seems like a solid approach.

I also started playing with https://research.facebook.com/blog/2021/7/high-dimensional-bayesian-optimization-with-sparsity-inducing-priors/ and https://twitter.com/SebastianAment/status/1720440669628514763
The log acquisition ones seem to work better than before but it did not make a huge difference. Still need to do proper test with SAASBO.

Great, happy to see you test this out, let us know how this goes!

FWIW there is quite a bit of work on robust GP(-like) regression, e.g. with student-t processes or using ideas from generalized Bayesian inference. This is something that we hope to support in the future but don't have any concrete plans for at this point.

from ax.

andrejmuhic avatar andrejmuhic commented on April 28, 2024 1

Hi, Balandat
I did some more inspection, it seems like FixedNoiseGP, so providing our approximate noise measurement of std would also be better option. I have also read a bit on Student T processes, intriguing. I have done robust regression with specifying only noise part in the past, even student t like errors in PyMC3. I guess this would be an easier first option here without implementing the whole Student T process machinery, sadly not enough time and I do not see considerable potential gain in this to justify even trying implementing this alone. It seems that I am approaching diminishing returns quickly here.

One more important thing.
What happens if one runs concurrent trials and one generates next trial and there is no new data for GP?
Kind of hard to dig out of internals if Sobol step is done in that case or there is some kind of restricted sampling going to prevent choosing the same candidate again in acquisition part?
The message that make sense is but it is hard to see what actually happens:
ax.modelbridge.torch: The observations are identical to the last set of observations used to fit the model. Skipping model fitting.

Thank you for your help. I am getting better and better insight.

from ax.

andrejmuhic avatar andrejmuhic commented on April 28, 2024 1

Hi, Balandat

What happens if one runs concurrent trials and one generates next trial and there is no new data for GP?

We take those trials into account as "pending points" when computing the acquisition value. Basically, we take the expectation of the acquisition value at some set of points X over the model poster at the pending points X_pending. This typically ensures that we're not generating identical candidates to those already running

This makes perfect sense.
Yes I am using parallel trials and providing pending_observations, either automatically or manually.
The message just confused me somewhat. Maybe instead of removing the message it would be better to make it more explicit as I think you already suggested. In a sense that if pending_observations are provided this is ok and expected. On other side if pending_observations are not provided and one does this manually this will generate exactly the same candidate as before which is almost never desired.
Like in https://ax.dev/tutorials/generation_strategy.html

from ax.

Balandat avatar Balandat commented on April 28, 2024

Hi @andrejmuhic, my first instinct would be to do some kind of log transform of the observations to squish the distribution together and render the problem easier to model (note that this wouldn't change the location of the optimum due to the monotonicity of the transform).

The issue is of course that you have negative values - do you have some kind of global lower bound L on the function values f that you know a priori? Then you could just transform your values y as y_tf = log(y - L) and try optimizing this.

If you don't know this global value a priori, then one could in principle try to estimate this in each iteration from the model (e.g. as the minimum observed value minus some interquantile range to avoid squishing things too much near zero), but afaik we don't have this hooked up in Ax at this point - so if you can have some reasonable guess of a global L (doesn't have to be perfect) that would be the easiest approach.

from ax.

andrejmuhic avatar andrejmuhic commented on April 28, 2024

Hi, thank you for your advice.
We will try it out and I will report how it goes.
We can have global bound probably.
So after shifting I guess any of this stuff goes:
https://petersonr.github.io/bestNormalize/
I was doing similar things in the past for normalizing the data.

from ax.

Balandat avatar Balandat commented on April 28, 2024

So after shifting I guess any of this stuff goes: https://petersonr.github.io/bestNormalize

In general, yes, though you have to be careful to make sure that the transform is consistent for all the data. Like, you can't just find the "best normalizing transform" for data of trials 1-10, pass that transformed data to Ax, and then compute another "best normalizing transform" based on data of trials 1-20 but only pass the transformed outcomes of trials 11-20 to Ax.

The transformation has to be applied consistently across all data (and care has to be taken if there are things like outcome constraints or relative effects). FWIW, we do some data transformation automatically in Ax (such as standardizing or winsorizing), but we don't automatically apply log transforms or other kinds of transforms (this is something we could do more of).

If the form of your data transform doesn't change, you'll be safe; One option would be to collect a bunch of observations (or use those that you already have), compute the "best normalizing transform" based on those, and then create a new Ax experiment where you add these results as manual arms, and then use the same transform for all subsequent results obtained on any new trials suggested by Ax.

from ax.

andrejmuhic avatar andrejmuhic commented on April 28, 2024

Hi @Balandat,

It took me some time.
I have experimented with things quite a bit. I did not like that R package that much at the end, the quantile transformation that is chosen does not generalize for extreme values of course.
So actually the most benefit according to:

cv = cross_validate(model)
diagnostics = compute_diagnostics(cv)

and also doing a final test on validation data
is obtained by doing winsorization like in Winsorize, some additional benefit also obtained by doing PowerTransformY.
I had to hack up my own transformations so I could do proper out of sample transformations.
I actually have a question here, how does StandardizeY, etc. work. Maybe I am wrong but it seems to me that the transformation is refitted every time when new data comes in, is this correct? I guess this is fine as the transformations are invertible but it means that kernel matrix is recomputed every update.

I am not sure how cross_validate is done, I tried reading the source code but this could cause data peaking in some sense if this is not refitted every time when doing the fold?
Ok found it right now, the default is refit_on_cv=False.

I would just like to make sure that I am doing right thing. The easiest thing for me would be to apply Winsorize (but one cannot invert this one obviously) and PowerTransformY that are already available, for this I need to pass the configs for this transformations, maybe there is an example how to this with ModelSetup. I have the same question again, it seems that the transformations are refitted when the new data comes in refit_on_update=True, is this technically correct or I should use my own as I am doing now.

I also started playing with https://research.facebook.com/blog/2021/7/high-dimensional-bayesian-optimization-with-sparsity-inducing-priors/ and https://twitter.com/SebastianAment/status/1720440669628514763
The log acquisition ones seem to work better than before but it did not make a huge difference. Still need to do proper test with SAASBO.

Thank you for your help, keep up the good work, I enjoyed reading the papers.

from ax.

Balandat avatar Balandat commented on April 28, 2024

What happens if one runs concurrent trials and one generates next trial and there is no new data for GP?

We take those trials into account as "pending points" when computing the acquisition value. Basically, we take the expectation of the acquisition value at some set of points X over the model poster at the pending points X_pending. This typically ensures that we're not generating identical candidates to those already running

The message that make sense is but it is hard to see what actually happens:

ax.modelbridge.torch: The observations are identical to the last set of observations used to fit the model. Skipping model fitting.

Yeah if you're working with pending points then this should be expected. I guess we could consider whether we'd want to reduce the logging of this in the context where this is expected - but it's probably not a bad diagnostic to have (cc @saitcakmak)

from ax.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.