How to use l-bfgs optimization method in neurodiffeq?,about neurodiffgym/neurodiffeq

Comments (20)

shuheng-liu commented on May 28, 2024

It's not possible at the moment, because torch.optim.LBFGS has a different .step() method from most popular optimizers (Adam, SGD, etc.). It requires a closure argument while for others it's optional.

torch.optim.LBFGS can be easily supported by refactoring the neurodiffeq.solvers.BaseSolver, we'll add support for it in the near future.

from neurodiffeq.

shuheng-liu commented on May 28, 2024

If you need it right now, you can implement that and contribute to neurodiffeq. We would very appreciate it.
If you are not sure how to do that, we'll let you know once we add the new feature.

from neurodiffeq.

matinmoezzi commented on May 28, 2024

@shuheng-liu Thank you for your reply. I know that torch.optim.LBFGS requires a closure function, so I am trying to add a closure function to the _run_epoch function in BaseSolver in order to support the LBFGS optimization.
In the _run_epoch function, the optimization step is performed after all batches. If I implement the closure function like below, then the optimization step only updates the parameters according to the last loss of the last batch. Shouldn't it update for every training sample?

        # perform optimization step when training
        if key == 'train':
            def closure():
                return loss
            self.optimizer.step(closure)
            # self._do_optimizer_step()
            self.optimizer.zero_grad()

from neurodiffeq.

shuheng-liu commented on May 28, 2024

Unfortunately, I'm not familiar with LBFGS either. But according to this post on pytorch forum and the documentation, it seems you should include more steps in closure.

Instead of simply returning the loss, you need to include these steps

zero the gradient
pass input (sampled points) to the model and obtain the output (function values)
calculate the loss (by calling the differential function)
call loss.backward()
return the loss tensor

I guess returning the same loss value won't work because LBFGS somehow evaluates the loss function several times, always.

I need more time to look into LBFGS. It's Chinese new year these days so there can be some delay. Apology in advance.

from neurodiffeq.

matinmoezzi commented on May 28, 2024

Thank you so much. I have read the LBFGS documentation of Pytorch but implementing this in the neurodiffeq has subtle points.
I am looking for hearing from you.
Happy Chinese new year.

from neurodiffeq.

matinmoezzi commented on May 28, 2024

I have implemented two modifications in the _run_epoch function of the BasedSolver class. In the first one, losses of every batch are stored in a list, and after all batches, the optimization step is executed (phase is training) with a closure function in which the mean of losses is returned.
In the second one, the optimization step is performed in each batch if the phase is 'train', and the closure function includes all steps of computing the loss.
*** Both of these are implemented to support the LBFGS optimization method.

The first one:

def _run_epoch(self, key):
        r"""Run an epoch on train/valid points, update history, and perform an optimization step if key=='train'.

        :param key: {'train', 'valid'}; phase of the epoch
        :type key: str

        .. note::
            The optimization step is only performed after all batches are run.
        """
        self._phase = key
        epoch_loss = 0.0
        metric_values = {name: 0.0 for name in self.metrics_fn}
        losses = []

        # perform forward pass for all batches: a single graph is created and release in every iteration
        # see https://discuss.pytorch.org/t/why-do-we-need-to-set-the-gradients-manually-to-zero-in-pytorch/4903/17
        for batch_id in range(self.n_batches[key]):
            batch = self._generate_batch(key)
            funcs = [
                self.compute_func_val(n, c, *batch) for n, c in zip(self.nets, self.conditions)
            ]

            for name in self.metrics_fn:
                value = self.metrics_fn[name](*funcs, *batch).item()
                metric_values[name] += value
            residuals = self.diff_eqs(*funcs, *batch)
            residuals = torch.cat(residuals, dim=1)
            loss = self.criterion(residuals) + self.additional_loss(funcs, key)

            # normalize loss across batches
            loss /= self.n_batches[key]
            losses.append(loss)

            # accumulate gradients before the current graph is collected as garbage
            if key == 'train':
                loss.backward()
            epoch_loss += loss.item()

        # calculate mean loss of all batches and register to history
        self._update_history(epoch_loss, 'loss', key)

        # perform optimization step when training
        if key == 'train':
            # def closure():
            #     return torch.stack(losses, dim=0).mean(dim=0)
            # self.optimizer.step(closure)
            self._do_optimizer_step()
            self.optimizer.zero_grad()
        # update lowest_loss and best_net when validating
        else:
            self._update_best()

        # calculate average metrics across batches and register to history
        for name in self.metrics_fn:
            self._update_history(
                metric_values[name] / self.n_batches[key], name, key)

and the second one:

def _run_epoch(self, key):
        r"""Run an epoch on train/valid points, update history, and perform an optimization step if key=='train'.

        :param key: {'train', 'valid'}; phase of the epoch
        :type key: str

        .. note::
            The optimization step is only performed after all batches are run.
        """
        self._phase = key
        epoch_loss = 0.0
        metric_values = {name: 0.0 for name in self.metrics_fn}

        # perform forward pass for all batches: a single graph is created and release in every iteration
        # see https://discuss.pytorch.org/t/why-do-we-need-to-set-the-gradients-manually-to-zero-in-pytorch/4903/17
        for batch_id in range(self.n_batches[key]):
            def closure():
                nonlocal epoch_loss
                batch = self._generate_batch(key)
                funcs = [
                    self.compute_func_val(n, c, *batch) for n, c in zip(self.nets, self.conditions)
                ]

                for name in self.metrics_fn:
                    value = self.metrics_fn[name](*funcs, *batch).item()
                    metric_values[name] += value
                residuals = self.diff_eqs(*funcs, *batch)
                residuals = torch.cat(residuals, dim=1)
                loss = self.criterion(residuals) + \
                    self.additional_loss(funcs, key)

                # normalize loss across batches
                loss /= self.n_batches[key]

                epoch_loss += loss.item()
                # accumulate gradients before the current graph is collected as garbage
                if key == 'train':
                    self.optimizer.zero_grad()
                    loss.backward()
                    return loss
            from torch.optim import LBFGS
            if key == 'train':
                if isinstance(self.optimizer, LBFGS):
                    self.optimizer.step(closure)
                else:
                    closure()
                    self._do_optimizer_step()
            else:
                closure()

        # calculate mean loss of all batches and register to history
        self._update_history(epoch_loss, 'loss', key)

        # perform optimization step when training
        # update lowest_loss and best_net when validating
        if key != 'train':
            self._update_best()

        # calculate average metrics across batches and register to history
        for name in self.metrics_fn:
            self._update_history(
                metric_values[name] / self.n_batches[key], name, key)

I appreciate anyone's time to comment on these.
Thanks.

from neurodiffeq.

shuheng-liu commented on May 28, 2024

Thanks for your ideas! However, there's a subtlety here.

Because training points are randomly sampled every time, there isn't that much difference between an epoch and batch. We still use distinguish batch and epoch because we can split a large amount of samples (which doesn't fit into memory) into multiple batches. We feed each batch to the model, and accumulate the gradient. At last, when all batches are fed, we perform a single optimization step based on the gradient. Then we call it an epoch.

Let's say we want to train on 10k points before performing an optimizer step, but our GPU/CPU memory only allows 5k points at a time. So we split them into two batches with 5k samples batch1 and batch2. To show the execution, I expand the for-loop into two consecutive blocks.

# first loop iteration 
u = model(batch1)  # create a graph
loss = pde(u, batch1)  # the graph gets larger, occupying nearly all memory
loss.backward()  # compute gradients of loss w.r.t. model parameters

# second loop iteration 
u = model(batch2)  # create a second graph
loss = pde(u, batch2)  # NOTICE: `loss` from previous iteration is overwritten. The first graph gets garbage-collected, freeing memory
loss.backward() # accumulate gradients of new loss w.r.t. model parameters

# outside the loop
optimizer.step()
optimizer.zero_gard()

Note that, if we keep losses in a list, then each item in losses will require the whole graph to be maintained in memory. The process becomes:

losses = []

# first loop iteration 
u = model(batch1)
loss = pde(u, batch1)
loss.backward()
losses.append(loss)

# second loop iteration 
u = model(batch2)
loss = pde(u, batch2)  # NOTICE: `loss` is overwritten, but `losses` contains the previous loss. No garbage collection will be performed.
loss.backward()
losses.append(loss)

# outside the loop
optimizer.step()
optimizer.zero_gard()

In this case, training on several batches will require the same amount of memory as training on a concatenated batch.

from neurodiffeq.

shuheng-liu commented on May 28, 2024

The second way seems promising, although I'm not sure how the nonlocal modifier behaves. I'll look into that later.

from neurodiffeq.

matinmoezzi commented on May 28, 2024

Are you sure that when the backward function is called, gradients are accumulated?
The nonlocal modifier allows us to modify the epoch_loss variable inside the closure function.

from neurodiffeq.

shuheng-liu commented on May 28, 2024

Thanks for explaining. And yes, I'm sure the gradients are accumulated (i.e. summed). This behavior is a consequent of PyTorch's reverse mode of automatic differentiation. Here's a simple example to try:

import torch
w = torch.tensor([1.0], requires_grad=True)
print(w.grad)  # w.grad is None

x1 = torch.tensor([2.0])
x2 = torch.tensor([3.0])

u = w * x1
u.backward()
print(w.grad)  # w.grad is 2.0, new memory space is allocated

v = w * x2
v.backward()
print(w.grad)  # w.grad is 2.0 + 3.0 = 5.0, no extra memory is required

from neurodiffeq.

shuheng-liu commented on May 28, 2024

I've implemented the closure feature in a new branch optimizer-closure, which should support torch.optim.LBFGS now. Can you try install the library with

# you may have to uninstall your existing installation of neurodiffeq first
pip install git+https://github.com/odegym/neurodiffeq.git@optimizer-closure

and see if it works fine? If so, this will be part of a new feature in v0.4.0 release.

from neurodiffeq.

matinmoezzi commented on May 28, 2024

Thank you for explaining the backward() function.
I tested the optimizer-closure branch through two examples, and for both of them, the test-loss and valid-loss approached infinity. If it is possible for you, please send me your test example which is working correctly.
Also, Two points came to my mind:
1- Why epoch_loss is aggregated and finally averaged? Shouldn't epoch_loss be the loss of the last run of closure?
2- According to the PyTorch docs about LBFGS, I think the training set should be the same in each execution of the closure function. In this modification, the training set is changed (batch_generator) every call of closure.

from neurodiffeq.

shuheng-liu commented on May 28, 2024

I seem to have made a mistake in my tests. The loss gradients are not used for optimization, which is really weird and I need more time to debug. As for your questions,

Epoch loss is the loss of all batches. As I previously explained, we only call it an epoch when all batches are run, so we divide each loss by the number of batches (n_batches[key]) before we add them up. In the meantime, are you certain that the last call to closure is what we need? I don't know why L-BFGS calls closure multiple times. Therefore I assumed that each call serve the same purpose. Since we are calling the closure function multiple times using L-BFGS, I average it further over number of time closure is called (closure_run_count).
I didn't quite see the part that the training set should be the same for the closure function in this link. Can you point out where you found the instructions? I'm not familiar with L-BFGS and I guess you are probably right. In that case, we'll have to use a different training strategy for L-BFGS. The logic for L-BFGS will be different from other optimizers. Namely, we must run the optimizer step() for every batch instead of running it only one time (after all batches are run).

from neurodiffeq.

matinmoezzi commented on May 28, 2024

I've read the PyTorch code of the L-BFGS class and figured out that L-BFGS tries to minimize the objective function's value (loss here) in each iteration, so the loss returned from the last call of closure is the real loss of the batch. In fact, adding closure_loss and returning it as the loss in closure is wrong.
According to the following code, the step function is called in each for iteration; thus, input and target are the same for each call of closure in one execution of the step function.

for input, target in dataset:
    def closure():
        optimizer.zero_grad()
        output = model(input)
        loss = loss_fn(output, target)
        loss.backward()
        return loss
    optimizer.step(closure)

Finally, I am confident that the step function should be executed in every batch with the same training pair.

from neurodiffeq.

matinmoezzi commented on May 28, 2024

I tested the below code in two systems of ODEs, which I got acceptable results.

def _run_epoch(self, key):
        r"""Run an epoch on train/valid points, update history, and perform an optimization step if key=='train'.

        :param key: {'train', 'valid'}; phase of the epoch
        :type key: str

        .. note::
            The optimization step is only performed after all batches are run.
        """
        self._phase = key
        epoch_loss = 0.0
        batch_loss = 0.0
        metric_values = {name: 0.0 for name in self.metrics_fn}

        # perform forward pass for all batches: a single graph is created and release in every iteration
        # see https://discuss.pytorch.org/t/why-do-we-need-to-set-the-gradients-manually-to-zero-in-pytorch/4903/17
        for batch_id in range(self.n_batches[key]):
            batch = self._generate_batch(key)

            def closure():
                nonlocal batch_loss
                if key == 'train':
                    self.optimizer.zero_grad()
                funcs = [
                    self.compute_func_val(n, c, *batch) for n, c in zip(self.nets, self.conditions)
                ]

                for name in self.metrics_fn:
                    value = self.metrics_fn[name](*funcs, *batch).item()
                    metric_values[name] += value
                residuals = self.diff_eqs(*funcs, *batch)
                residuals = torch.cat(residuals, dim=1)
                loss = self.criterion(residuals) + self.additional_loss(funcs, key)

                # accumulate gradients before the current graph is collected as garbage
                if key == 'train':
                    loss.backward()
                    batch_loss = loss.item()
                return loss
                
            if key == 'train':
                self._do_optimizer_step(closure=closure)
                epoch_loss += batch_loss
            else:
                epoch_loss += closure().item()

        # calculate mean loss of all batches and register to history
        self._update_history(epoch_loss / self.n_batches[key], 'loss', key)

        # update lowest_loss and best_net when validating
        if key == 'valid':
            self._update_best()

        # calculate average metrics across batches and register to history
        for name in self.metrics_fn:
            self._update_history(
                metric_values[name] / self.n_batches[key], name, key)

    def _do_optimizer_step(self, closure=None):
        r"""Optimization procedures after gradients have been computed. Usually ``self.optimizer.step()`` is sufficient.
        At times, users can overwrite this method to perform gradient clipping, etc. Here is an example::

            import itertools
            class MySolver(Solver)
                def _do_optimizer_step(self):
                    nn.utils.clip_grad_norm_(itertools.chain([net.parameters() for net in self.nets]), 1.0, 'inf')
                    self.optimizer.step()
        """
        return self.optimizer.step(closure=closure)

from neurodiffeq.

shuheng-liu commented on May 28, 2024

Thanks a lot! I'll do a little change so that when using other optimizers (which doesn't require a closure argument in .step()), the training will be performed as normal; i.e., only update the parameter after all batches are run.

from neurodiffeq.

matinmoezzi commented on May 28, 2024

I would like to contribute to the project and commit the code myself. Would you accept my pull request?

from neurodiffeq.

shuheng-liu commented on May 28, 2024

Of course! You're welcome to contribute and thanks for your interest.

from neurodiffeq.

shuheng-liu commented on May 28, 2024

Hi Matin, I made a little changes yesterday in #93 for compatibility.
Can you confirm you are happy with it? If so, I'm going to merge it into master.

from neurodiffeq.

shuheng-liu commented on May 28, 2024

@matinmoezzi

from neurodiffeq.

How to use l-bfgs optimization method in neurodiffeq? about neurodiffeq HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent