Comments (4)
Hi @CCranney, thanks for your interest!
Fortunately, we added a Submitit tutorial for submitting jobs to slurm just the other day: https://github.com/facebook/Ax/blob/main/tutorials/submitit.ipynb
However, I cannot see where those kinds of error messages are ever created, and therefore have difficulty debugging the process. How would I see what errors caused the trials to fail?
In the PyTorch MOO NAS tutorial, the jobs are running locally in via TorchX in their own process, which makes debugging quite hard. I think there should be a way to somehow set things up in TorchX to pipe the logs back to the TorchXRunner
or at least save them to disk, but I'm not sure how to do that (if you figure it out please let us know, I think this is a question for the TorchX folks).
What changes would need to occur in a Runner object to be submitted as a batch job?
I haven't tried the TorchXRunner with remote job submission yet (we use a different kind of remote backend), so unfortunately I don't know the answer here. I would recommend starting from TorchX and see if you can just use pure TorchX code to run a slurm job on the cluster - that should tell you what kind of settings would need to be piped through to the respective TorchX logic from the runner (this might require some minor updates to the runner, we'd be happy to help with those).
But if you're fine with using Submitit, I recommend you give the tutorial I mentioned above a try.
from ax.
Thank you @Balandat! This definitely set me off on the right direction. I'll be digging into the TorchXRunner question further, if I get those answers I'll definitely reach out. Thank you for the tutorial and for the tips! I'll go ahead and close this issue.
from ax.
Great. Please do report back and share your solution :)
from ax.
Hi all,
While I have not resolved the best method for deploying jobs to the scheduler, I have landed on a debugging solution for investigating specific trial runs. My code is split into two files, one that utilizes Ax classes to compile experiments, trials, and runners and schedulers. This script relies on a second file that uses pytorch-lightning to generate, train and evaluate models specified by the search space. All of this is outlined in the tutorial link in my first post.
Because it is errors in the second script file that are not printed to the screen/terminal, I have found a method for logging all output of the second script to a text file. It's a workaround, but it works at identifying how or why specific trials failed.
I pasted the following code at the top of the second script. Note that I manually created a log
directory for it to save files to prior to running the code.
import logging
import sys
import io
from datetime import datetime
class StreamToLogger(io.TextIOBase):
def __init__(self, logger, level=logging.INFO):
self.logger = logger
self.level = level
self.linebuf = ''
def write(self, buf):
for line in buf.rstrip().splitlines():
self.logger.log(self.level, line.rstrip())
# Configure the logging module
logging.basicConfig(filename=f'logs/output_{datetime.now().strftime("%Y%m%d-%H%M%S-%f")}.log', level=logging.INFO)
# Redirect stdout to the logger
stdout_logger = logging.getLogger('STDOUT')
sys.stdout = StreamToLogger(stdout_logger, logging.INFO)
# Redirect stderr to the logger
stderr_logger = logging.getLogger('STDERR')
sys.stderr = StreamToLogger(stderr_logger, logging.ERROR)
The saved files are marked by the date and time down to microseconds (to ensure that even rapidly-generated jobs will not accidentally write to the same file). This may be overkill, but better too much info than not enough in my opinion.
from ax.
Related Issues (20)
- Out of Memory crash issue HOT 6
- "Hyperparameter Optimization via Raytune" link in website is broken. HOT 2
- Using `evaluate_acquisition_function` on `AxClient` causes subsequent optimziation errors HOT 2
- Question : modifications of compute_posterior_pareto_frontier HOT 2
- Tracking of auxiliary metrics HOT 2
- `qMaxValueEntropy` doesn't seem to work with `ObjectiveProperties(minimize=True)` HOT 4
- when should we end the Bandit Optimization HOT 4
- Nontrivial parameter constraints HOT 2
- Can't control arguments in fit_gpytorch_mll under the hood. Getting ABNORMAL_TERMINATION_IN_LNSRCH warning HOT 1
- Different Errors when initializing my loop with Service API and Developer API HOT 7
- `_random_seed` not retained when using `ax_client.save_to_json_file()` and `AxClient.load_from_json_file()` HOT 2
- Question: SEBO optimization with parameter dependency | logistic parameter constrains HOT 4
- Issue with tolerance for floating point and its relevance when using log_scale = True HOT 7
- Question: does Ax support working with Tensorflow models? HOT 2
- Feature Request: Conditional Parameter Constraints HOT 5
- Questions about define how to evaluate HOT 3
- get_countour_plot() not plotting all trials HOT 4
- Error: A list of 'ChoiceParameter' is not iterable HOT 4
- [Bug] Generation Strategy equality check error without call to repr HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ax.