Coder Social home page Coder Social logo

Comments (4)

NormanTUD avatar NormanTUD commented on April 28, 2024

https://github.com/NormanTUD/OmniOpt/tree/main/ax

Main script:

https://github.com/NormanTUD/OmniOpt/blob/main/ax/.omniopt.py

Maybe for anyone looking through the environment the problem is appearing in, my general plan is to allow this:

./omniopt --partition=alpha --experiment_name=example --mem_gb=1 --time=60 --worker_timeout=60 --max_eval=500 --num_parallel_jobs=500 --gpus=1 --follow --run_program=ZWNobyAiUkVTVUxUOiAlKHBhcmFtKSI= --parameter param range 0 1000 float

and to run that optimization on our clusters and to use ax/botorch internally for hyper parameter optimization. We have basically unlimited resources for free (university) and want to have as many workers in parallel as possible to gain from the HPC as much as possible in finding good hyperparameters for every type of problem or just researching those areas (depending on what your program does).

On the top of the code is a large comment showing some things I tried, the list is anything but complete though.

It would really be appreciated by us if you helped us with that.

Yours sincerly

NormanTUD

from ax.

mgarrard avatar mgarrard commented on April 28, 2024

Hi @NormanTUD! Thanks so much for engaging with our tool - happy to help. Could you provide the logs from AxClient for your experiment? These logs usually contain information about the trial generation and generation strategy that will be helpful for us debugging the issue.

Also good catch on "use_batch_trials" not having an effect. This code hasn't been opensourced yet (hopefully soon!), so it isn't doing anything at this time. Let me raise an error to make that more clear.

from ax.

mgarrard avatar mgarrard commented on April 28, 2024

@NormanTUD -- added a PR for an error to populate with use_batch_trials, it'll be live once we cut a new release :)

Let me know if you have the logs from AxClient for additional support. Thanks!

from ax.

NormanTUD avatar NormanTUD commented on April 28, 2024

Hi,

thanks for your reply. I was on vacation and as such, didn't code anything. But currently, I am trying to get all logs now. Thanks for the patience. I will update this post when I have the logs.

First a bit of my own debugging code:

Update #1:

1531                                 trial_index_to_param, _ = ax_client.get_next_trials(                                                                              
1532                                     max_trials=1                                                                                                                  
1533                                 )                                                                                                                                 
1534                                                                                                                                                                   
1535                                 print_debug(f"Got {len(trial_index_to_param.items())} new items (m = {m}, in range(0, {calculated_max_trials})).")                

These lines are only executed when there are new jobs to be generated (in a for loop for further testing instead of by changing max_trials= to the number of new trials, it's set to 1, but in a for loop for each new job). But sometimes, I get this:

2024-03-26 11:14:13: Got 0 new items (m = 0, in range(0, 33)).

So it just returns 0 jobs.

These are the number of workers over time:

17
7
5
8

(No time given there though, it's in each generative loop)

It should be around ~20, so 17 is fine for a snapshot during starting the jobs, but over time, it gets much less.

The only message I can see from ax that seems relevant seems to be this:

ax.models.torch.botorch_modular.acquisition: 
Encountered Xs pending for some Surrogates but observed for others. Considering 
these points to be pending.

from ax.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.