Comments (20)
It's not possible at the moment, because torch.optim.LBFGS
has a different .step()
method from most popular optimizers (Adam, SGD, etc.). It requires a closure
argument while for others it's optional.
torch.optim.LBFGS
can be easily supported by refactoring the neurodiffeq.solvers.BaseSolver
, we'll add support for it in the near future.
from neurodiffeq.
If you need it right now, you can implement that and contribute to neurodiffeq
. We would very appreciate it.
If you are not sure how to do that, we'll let you know once we add the new feature.
from neurodiffeq.
@shuheng-liu Thank you for your reply. I know that torch.optim.LBFGS
requires a closure
function, so I am trying to add a closure function to the _run_epoch
function in BaseSolver
in order to support the LBFGS optimization.
In the _run_epoch
function, the optimization step is performed after all batches. If I implement the closure
function like below, then the optimization step only updates the parameters according to the last loss of the last batch. Shouldn't it update for every training sample?
# perform optimization step when training
if key == 'train':
def closure():
return loss
self.optimizer.step(closure)
# self._do_optimizer_step()
self.optimizer.zero_grad()
from neurodiffeq.
Unfortunately, I'm not familiar with LBFGS
either. But according to this post on pytorch forum and the documentation, it seems you should include more steps in closure
.
Instead of simply returning the loss, you need to include these steps
- zero the gradient
- pass input (sampled points) to the model and obtain the output (function values)
- calculate the
loss
(by calling the differential function) - call
loss.backward()
- return the
loss
tensor
I guess returning the same loss value won't work because LBFGS
somehow evaluates the loss function several times, always.
I need more time to look into LBFGS
. It's Chinese new year these days so there can be some delay. Apology in advance.
from neurodiffeq.
Thank you so much. I have read the LBFGS documentation of Pytorch but implementing this in the neurodiffeq
has subtle points.
I am looking for hearing from you.
Happy Chinese new year.
from neurodiffeq.
I have implemented two modifications in the _run_epoch
function of the BasedSolver
class. In the first one, losses of every batch are stored in a list, and after all batches, the optimization step is executed (phase is training) with a closure function in which the mean of losses is returned.
In the second one, the optimization step is performed in each batch if the phase is 'train', and the closure function includes all steps of computing the loss.
*** Both of these are implemented to support the LBFGS optimization method.
The first one:
def _run_epoch(self, key):
r"""Run an epoch on train/valid points, update history, and perform an optimization step if key=='train'.
:param key: {'train', 'valid'}; phase of the epoch
:type key: str
.. note::
The optimization step is only performed after all batches are run.
"""
self._phase = key
epoch_loss = 0.0
metric_values = {name: 0.0 for name in self.metrics_fn}
losses = []
# perform forward pass for all batches: a single graph is created and release in every iteration
# see https://discuss.pytorch.org/t/why-do-we-need-to-set-the-gradients-manually-to-zero-in-pytorch/4903/17
for batch_id in range(self.n_batches[key]):
batch = self._generate_batch(key)
funcs = [
self.compute_func_val(n, c, *batch) for n, c in zip(self.nets, self.conditions)
]
for name in self.metrics_fn:
value = self.metrics_fn[name](*funcs, *batch).item()
metric_values[name] += value
residuals = self.diff_eqs(*funcs, *batch)
residuals = torch.cat(residuals, dim=1)
loss = self.criterion(residuals) + self.additional_loss(funcs, key)
# normalize loss across batches
loss /= self.n_batches[key]
losses.append(loss)
# accumulate gradients before the current graph is collected as garbage
if key == 'train':
loss.backward()
epoch_loss += loss.item()
# calculate mean loss of all batches and register to history
self._update_history(epoch_loss, 'loss', key)
# perform optimization step when training
if key == 'train':
# def closure():
# return torch.stack(losses, dim=0).mean(dim=0)
# self.optimizer.step(closure)
self._do_optimizer_step()
self.optimizer.zero_grad()
# update lowest_loss and best_net when validating
else:
self._update_best()
# calculate average metrics across batches and register to history
for name in self.metrics_fn:
self._update_history(
metric_values[name] / self.n_batches[key], name, key)
and the second one:
def _run_epoch(self, key):
r"""Run an epoch on train/valid points, update history, and perform an optimization step if key=='train'.
:param key: {'train', 'valid'}; phase of the epoch
:type key: str
.. note::
The optimization step is only performed after all batches are run.
"""
self._phase = key
epoch_loss = 0.0
metric_values = {name: 0.0 for name in self.metrics_fn}
# perform forward pass for all batches: a single graph is created and release in every iteration
# see https://discuss.pytorch.org/t/why-do-we-need-to-set-the-gradients-manually-to-zero-in-pytorch/4903/17
for batch_id in range(self.n_batches[key]):
def closure():
nonlocal epoch_loss
batch = self._generate_batch(key)
funcs = [
self.compute_func_val(n, c, *batch) for n, c in zip(self.nets, self.conditions)
]
for name in self.metrics_fn:
value = self.metrics_fn[name](*funcs, *batch).item()
metric_values[name] += value
residuals = self.diff_eqs(*funcs, *batch)
residuals = torch.cat(residuals, dim=1)
loss = self.criterion(residuals) + \
self.additional_loss(funcs, key)
# normalize loss across batches
loss /= self.n_batches[key]
epoch_loss += loss.item()
# accumulate gradients before the current graph is collected as garbage
if key == 'train':
self.optimizer.zero_grad()
loss.backward()
return loss
from torch.optim import LBFGS
if key == 'train':
if isinstance(self.optimizer, LBFGS):
self.optimizer.step(closure)
else:
closure()
self._do_optimizer_step()
else:
closure()
# calculate mean loss of all batches and register to history
self._update_history(epoch_loss, 'loss', key)
# perform optimization step when training
# update lowest_loss and best_net when validating
if key != 'train':
self._update_best()
# calculate average metrics across batches and register to history
for name in self.metrics_fn:
self._update_history(
metric_values[name] / self.n_batches[key], name, key)
I appreciate anyone's time to comment on these.
Thanks.
from neurodiffeq.
Thanks for your ideas! However, there's a subtlety here.
Because training points are randomly sampled every time, there isn't that much difference between an epoch and batch. We still use distinguish batch and epoch because we can split a large amount of samples (which doesn't fit into memory) into multiple batches. We feed each batch to the model, and accumulate the gradient. At last, when all batches are fed, we perform a single optimization step based on the gradient. Then we call it an epoch.
Let's say we want to train on 10k points before performing an optimizer step, but our GPU/CPU memory only allows 5k points at a time. So we split them into two batches with 5k samples batch1
and batch2
. To show the execution, I expand the for-loop into two consecutive blocks.
# first loop iteration
u = model(batch1) # create a graph
loss = pde(u, batch1) # the graph gets larger, occupying nearly all memory
loss.backward() # compute gradients of loss w.r.t. model parameters
# second loop iteration
u = model(batch2) # create a second graph
loss = pde(u, batch2) # NOTICE: `loss` from previous iteration is overwritten. The first graph gets garbage-collected, freeing memory
loss.backward() # accumulate gradients of new loss w.r.t. model parameters
# outside the loop
optimizer.step()
optimizer.zero_gard()
Note that, if we keep losses in a list, then each item in losses
will require the whole graph to be maintained in memory. The process becomes:
losses = []
# first loop iteration
u = model(batch1)
loss = pde(u, batch1)
loss.backward()
losses.append(loss)
# second loop iteration
u = model(batch2)
loss = pde(u, batch2) # NOTICE: `loss` is overwritten, but `losses` contains the previous loss. No garbage collection will be performed.
loss.backward()
losses.append(loss)
# outside the loop
optimizer.step()
optimizer.zero_gard()
In this case, training on several batches will require the same amount of memory as training on a concatenated batch.
from neurodiffeq.
The second way seems promising, although I'm not sure how the nonlocal
modifier behaves. I'll look into that later.
from neurodiffeq.
Are you sure that when the backward
function is called, gradients are accumulated?
The nonlocal
modifier allows us to modify the epoch_loss
variable inside the closure
function.
from neurodiffeq.
Thanks for explaining. And yes, I'm sure the gradients are accumulated (i.e. summed). This behavior is a consequent of PyTorch's reverse mode of automatic differentiation. Here's a simple example to try:
import torch
w = torch.tensor([1.0], requires_grad=True)
print(w.grad) # w.grad is None
x1 = torch.tensor([2.0])
x2 = torch.tensor([3.0])
u = w * x1
u.backward()
print(w.grad) # w.grad is 2.0, new memory space is allocated
v = w * x2
v.backward()
print(w.grad) # w.grad is 2.0 + 3.0 = 5.0, no extra memory is required
from neurodiffeq.
I've implemented the closure feature in a new branch optimizer-closure
, which should support torch.optim.LBFGS
now. Can you try install the library with
# you may have to uninstall your existing installation of neurodiffeq first
pip install git+https://github.com/odegym/neurodiffeq.git@optimizer-closure
and see if it works fine? If so, this will be part of a new feature in v0.4.0 release.
from neurodiffeq.
Thank you for explaining the backward()
function.
I tested the optimizer-closure
branch through two examples, and for both of them, the test-loss
and valid-loss
approached infinity. If it is possible for you, please send me your test example which is working correctly.
Also, Two points came to my mind:
1- Why epoch_loss
is aggregated and finally averaged? Shouldn't epoch_loss
be the loss of the last run of closure
?
2- According to the PyTorch docs about LBFGS, I think the training set should be the same in each execution of the closure
function. In this modification, the training set is changed (batch_generator
) every call of closure
.
from neurodiffeq.
I seem to have made a mistake in my tests. The loss gradients are not used for optimization, which is really weird and I need more time to debug. As for your questions,
- Epoch loss is the loss of all batches. As I previously explained, we only call it an epoch when all batches are run, so we divide each
loss
by the number of batches (n_batches[key]
) before we add them up. In the meantime, are you certain that the last call toclosure
is what we need? I don't know why L-BFGS callsclosure
multiple times. Therefore I assumed that each call serve the same purpose. Since we are calling theclosure
function multiple times using L-BFGS, I average it further over number of timeclosure
is called (closure_run_count
). - I didn't quite see the part that the training set should be the same for the
closure
function in this link. Can you point out where you found the instructions? I'm not familiar with L-BFGS and I guess you are probably right. In that case, we'll have to use a different training strategy for L-BFGS. The logic for L-BFGS will be different from other optimizers. Namely, we must run the optimizerstep()
for every batch instead of running it only one time (after all batches are run).
from neurodiffeq.
I've read the PyTorch code of the L-BFGS class and figured out that L-BFGS tries to minimize the objective function's value (loss
here) in each iteration, so the loss returned from the last call of closure
is the real loss of the batch. In fact, adding closure_loss
and returning it as the loss in closure
is wrong.
According to the following code, the step
function is called in each for
iteration; thus, input
and target
are the same for each call of closure
in one execution of the step
function.
for input, target in dataset:
def closure():
optimizer.zero_grad()
output = model(input)
loss = loss_fn(output, target)
loss.backward()
return loss
optimizer.step(closure)
Finally, I am confident that the step
function should be executed in every batch with the same training pair.
from neurodiffeq.
I tested the below code in two systems of ODEs, which I got acceptable results.
def _run_epoch(self, key):
r"""Run an epoch on train/valid points, update history, and perform an optimization step if key=='train'.
:param key: {'train', 'valid'}; phase of the epoch
:type key: str
.. note::
The optimization step is only performed after all batches are run.
"""
self._phase = key
epoch_loss = 0.0
batch_loss = 0.0
metric_values = {name: 0.0 for name in self.metrics_fn}
# perform forward pass for all batches: a single graph is created and release in every iteration
# see https://discuss.pytorch.org/t/why-do-we-need-to-set-the-gradients-manually-to-zero-in-pytorch/4903/17
for batch_id in range(self.n_batches[key]):
batch = self._generate_batch(key)
def closure():
nonlocal batch_loss
if key == 'train':
self.optimizer.zero_grad()
funcs = [
self.compute_func_val(n, c, *batch) for n, c in zip(self.nets, self.conditions)
]
for name in self.metrics_fn:
value = self.metrics_fn[name](*funcs, *batch).item()
metric_values[name] += value
residuals = self.diff_eqs(*funcs, *batch)
residuals = torch.cat(residuals, dim=1)
loss = self.criterion(residuals) + self.additional_loss(funcs, key)
# accumulate gradients before the current graph is collected as garbage
if key == 'train':
loss.backward()
batch_loss = loss.item()
return loss
if key == 'train':
self._do_optimizer_step(closure=closure)
epoch_loss += batch_loss
else:
epoch_loss += closure().item()
# calculate mean loss of all batches and register to history
self._update_history(epoch_loss / self.n_batches[key], 'loss', key)
# update lowest_loss and best_net when validating
if key == 'valid':
self._update_best()
# calculate average metrics across batches and register to history
for name in self.metrics_fn:
self._update_history(
metric_values[name] / self.n_batches[key], name, key)
def _do_optimizer_step(self, closure=None):
r"""Optimization procedures after gradients have been computed. Usually ``self.optimizer.step()`` is sufficient.
At times, users can overwrite this method to perform gradient clipping, etc. Here is an example::
import itertools
class MySolver(Solver)
def _do_optimizer_step(self):
nn.utils.clip_grad_norm_(itertools.chain([net.parameters() for net in self.nets]), 1.0, 'inf')
self.optimizer.step()
"""
return self.optimizer.step(closure=closure)
from neurodiffeq.
Thanks a lot! I'll do a little change so that when using other optimizers (which doesn't require a closure
argument in .step()
), the training will be performed as normal; i.e., only update the parameter after all batches are run.
from neurodiffeq.
I would like to contribute to the project and commit the code myself. Would you accept my pull request?
from neurodiffeq.
Of course! You're welcome to contribute and thanks for your interest.
from neurodiffeq.
Hi Matin, I made a little changes yesterday in #93 for compatibility.
Can you confirm you are happy with it? If so, I'm going to merge it into master.
from neurodiffeq.
from neurodiffeq.
Related Issues (20)
- Use a single network for ODE/PDE systems HOT 2
- Unable to import solver and monitor HOT 6
- High Order Optimizers HOT 4
- Special Type Boundary Condition HOT 4
- Is there a way to access the train/valid loss history for Solver1D, like for solve? HOT 2
- Using Special type activation function HOT 9
- Finding value of weights HOT 2
- Saving subclass of solvers HOT 1
- TqdmKeyError: "Unknown argument(s): {'colour': 'blue'}" HOT 3
- Publish `neurodiffeq` on `conda-forge`?
- Neumann boundary conditions
- Nonzero Dirichlet boundary conditions HOT 8
- Add tests for solver_utils
- BundleSolver setup too restrictive HOT 1
- fitting a variable in a system of ODE as a function of time HOT 4
- Problems :return inspect.signature(optimizer.step).parameters.get('closure').default == inspect._empty AttributeError: 'NoneType' object has no attribute 'default' HOT 2
- Parametric system of ODEs HOT 1
- Bundle Solution for PDEs
- Improve Docs
- Solving system of PDEs/ODEs HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from neurodiffeq.