cooper-org / cooper Goto Github PK

View Code? Open in Web Editor NEW

98.0 98.0 8.0 1.71 MB

A general-purpose, deep learning-first library for constrained optimization in PyTorch

Home Page: https://cooper.readthedocs.io/

License: MIT License

Makefile 1.06% Python 44.10% Jupyter Notebook 54.84%

constrained-optimization deep-learning machine-learning optimization python pytorch

cooper's People

Contributors

Stargazers

Watchers

Forkers

qiaoptdun sofia-shi daoterog isitarex merajhashemi dsceafit martin-stoehr juanelenter

cooper's Issues

Relax Python requirements to support 3.7

Enhancement

The current package configuration requires Python 38. I believe there is nothing specific keeping us from enabling Python 37. This can be helpful to make Cooper easier to use in Colab which currently supports Python 37.

Constraint schedulers

Enhancement

Enable "schedulers" for the constraints.

Consider a base CMP with objective function $f$ and inequality constraints $g \le \epsilon$. The scheduler could allow the construction of a "moving target" where the constraint is gradually strengthened. One could start with a sequence of optimization problems:
$$ \min_x f(x) \quad \text{s.t.} \quad g(x) \le \epsilon + \psi_t$$
such that the "slack" $\psi_t \rightarrow 0$.

Motivation

Theoretically, this sequence of optimization problems should yield equivalent solutions to that of the base CMP. However, specific (implementations of) algorithms can benefit from relaxing the optimization problem, especially towards the beginning of the optimization.

In the end, we might care about achieving an (approximately) feasible solution of the base CMP that has a good value of the objective function. Thus, there is no need to be overly strict at the beginning of training trying to achieve the "final" constraint level $\epsilon$.

"Curriculum learning" (Bengio et al., 2009) successfully the idea of gradually adjusting the problem difficulty for supervised learning tasks in ML.

Implementation proposal

Pytorch's learning rate schedulers are usually tied to a particular optimizer. For this reason they might not be directly portable for implementing constraint schedulers, but some of their scheduler framework and implementations could be re-used.

References

Bengio, Louradour, Collobert and Weston (2009) Curriculum Learning [link to pdf]

Pytorch LR Scheduler Integration

Context: Many current training pipelines involve tweaking the step size/learning rate throughout training to achieve good performance. Two common scenarios for this are:

Decreasing the learning rate can help to get closer to the optimum, rather than keep bouncing around it.
It can be useful when considering problems in which the theoretical guarantees involved an assumption on diminishing step sizes.

Proposal: Add capabilities to use Pytorch's LR schedulers along with Cooper.

Challenges:

It should be relatively straightforward to handle this for the primal_optimizer as it is "fully instantiated" by the user before creating the ConstrainedOptimizer.
It gets a bit tricky with the dual_optimizer as we don't have access to the full optimizer before the Lagrange multipliers have been initialized.
One important thing here is to keep the definition of the lr scheduler as consistent as possible with that of the dual optimizer: since we ask for a "partially instantiated optimizer" maybe we should ask for a "partially instantiated scheduler"?
Shall we just leave it under the user's control (just as in Pytorch) to perform the calls to lr_scheduler.step()?

Originally posted by @gallego-posada in #16

Easy as possible to use

Enhancement

Make Cooper as easy as possible to integrate with a vanilla Pytorch pipeline.

In particular, allow for the user to populate a CMPState and provide it to Cooper without having to create a custom ConstrainedMinimizationProblem.

Note that we should not remove the CMP altogether. In particular, extrapolation and alternating updates would still require internal calls to cmp.closure and cmp.defect_fn.

This would require a major overhaul of the documentation, tutorials, and examples, showcasing that both the previous CMP approach and this new approach are admissible. Love for the documentation has also been requested in #53 and #29.

Motivation

Users are currently required to implement a custom ConstrainedMinimizationProblem with a closure method.
This overhead can be detrimental to Cooper attracting ML researchers and practitioners who would otherwise compute the loss and constraint violations themselves.

For minimal integrations with Cooper which do not need "fancy features" like Augmented Lagrangian or Extrapolation, simply asking for the loss and constraint defects makes the user's life easier.

Alternatives

It may be possible to remove the CMP completely. Nonetheless, we would still require a closure for an internal call during extrapolation steps and a defect_fc for internal use by the AlternatingConstrainedOptimizer.

Checkpointing

Enhancement

Checkpoint the ConstrainedOptimizer. This entails saving and loading both the primal_optimizer.state_dict() and dual_optimizer.state_dict().

Motivation

This enhancement would facilitate checkpointing for projects which depend on Cooper. Currently, optimizer checkpoints must be handled manually by the user.

References

Pytorch's Saving and Loading a General Checkpoint: https://pytorch.org/tutorials/recipes/recipes/saving_and_loading_a_general_checkpoint.html

dual_scheduler steps may happen multiple times per epoch

Bug

The call to dual_scheduler.step() happens at the end of ConstrainedOptimizer.step(). This means that dual_scheduler steps happen at the end of every optimization step and not every optimization epoch.

cooper/cooper/constrained_optimizer.py

Lines 235 to 240 in eae7c5a

    
           if self.dual_scheduler is not None: 
        
               # Do a step on the dual scheduler after the actual step on 
        
               # the dual parameters. Intermediate updates that take 
        
               # place inside the extrapolation process do not perform a 
        
               # call to the scheduler's step method 
        
               self.dual_scheduler.step()

This goes against the Pytorch convention on learning rate schedulers.

Expected behavior

Calls to dual_scheduler.step() should happen at the end of every epoch.

How to fix

We could:
(i) Remove the call to dual_scheduler.step() from inside the ConstrainedOptimizer and let the user handle it. The instantiation of the dual scheduler could still be performed internally so that the partial_lr_scheduler logic is preserved (and for the user to not worry about it).
(ii) Indicate whether a step is the last of its epoch and only then call dual_scheduler.step().
(iii) Create a class which wraps both the primal and dual schedulers (like how the ConstrainedOptimizer wraps primal and dual optimizers), whose step() method is manually called by the user at the end of an epoch. The class could internally handle the initialization of the dual scheduler. This approach is similar to (i)

Multiplier Models

Enhancement

Implement a "Multiplier Model" in Cooper. This idea comes from the paper by Narasimhan et al.

Instead of tracking the Lagrange Multipliers of a constrained optimization problem explicitly, consider a model which takes as input features associated with each constraint and outputs their corresponding multiplier.

This requires a new kind of Multiplier in Cooper different from a DenseMultiplier. In turn, we might want to create and abstract class for multipliers outlining how a custom multiplier needs to be implemented.

This also requires a new formulation or the modification of an existing one. Note that the dual_parameters do not match the multipliers in this new functionality. Multipliers are the outputs of the Multiplier Model, whereas the dual_parameters would be the actual weights of said model.

Motivation

Consider problems with an extremely large (or even infinite) number of constraints. It becomes computationally intractable to evaluate and keep track of each constraint. Moreover, when solving the problem with an (explicit) Lagrangian formulation, there is the requirement to store and update an equally large number of multipliers.

With a Multiplier Model, Lagrange multipliers can be computed "on demand", given the constraints that were evaluated at each time step. As long as the Multiplier Model has less parameters than the total possible number of multipliers, there is a gain in terms of storage.

Also, note that the Multiplier Model's parameters are shared across constraints. Therefore, they can learn to have similar multipliers for constraints which tend to be similar to one another.

References

H. Narasimhan, A. Cotter, Y. Zhou, S. Wang, and W. Guo. Approximate Heavily-Constrained Learning with Lagrange Multiplier Models. In NeurIPS, 2020b.

Remove aliases to Pytorch implementations

Enhancement

Remove existing aliases to objects in torch.optim. Update documentation accordingly.

Motivation

Having these aliases makes the interface opaque for the user: it is hard to tell that cooper.optim.SGD is in fact the same implementation as torch.optim.SGD.

Docs example miss-understood

Enhancement

Motivation

Alternatives

References

Document modularized optimizers

Enhancement

This PR #52 made significant changes to the structure of the code for constrained optimizers. However, this change has not been fully integrated into the docs.

Adding a much simpler example

Enhancement

To aid the understanding of this great library one could add a very simple example problem, a proposal is shown below.

Motivation

Currently, the optimization problem in the examples are rather complex. To make using the library more intuitive, one can add a dummy example that is so simple that it can be performed by hand, but that would similarly make the approach easy to understand.

Alternatives

I propose solving the following: minimize the distance to the origin such that it is on the line $y=1-x$. Essentially:

$$\min_{x, y} x^2+ y^2 \qquad \text{s.t.:} \,\, x + y - 1 = 0$$

an example problem discussed in the (Platt and Barr 1987) paper on constrained optimizations and led to the development of another Lagrange Multiplier Optimization library called mdmm. The cooper implementation is:

"""
Pet problem from Constrain Differential Optimization (Plat and Barr 1987).

The paper discusses general problems of the following format.
    min f(x)
        s.t.: g(x)

And discusses among others the example of finding the closest point to the origin such that it is on the line y=1-x.
This can be found by minimizing the following equation:
    min x^2 + y^2
        s.t.: y + x - 1 =0
Which is being solved by this example.

Bram van der Heijden
2023
"""
import matplotlib.pyplot as plt
import numpy as np
import cooper
import torch


class ConstraintOptimizer(cooper.ConstrainedMinimizationProblem):
    def __init__(self, loss, constraint):
        self.loss = loss
        self.constraint = constraint
        super().__init__(is_constrained=True)

    def closure(self, x):
        # Compute the cost function.
        loss = self.loss(x)

        # Compute the violation of the constraint function.
        eq_defect = self.constraint(x)

        return cooper.CMPState(loss=loss, eq_defect=eq_defect)


def f(x):
    """Cost function representing the square distance from the origin"""
    loss = x[0]**2 + x[1]**2
    return loss


def g(x):
    """Constraint function, representing a linear line y=1-x"""
    loss = x[0] + x[1] - 1
    return loss


# Define optimization problem.
cmp = ConstraintOptimizer(f, g)
formulation = cooper.LagrangianFormulation(cmp)

# Define the primal variables and optimizer.
x = torch.tensor([4, 3], dtype=torch.float32, requires_grad=True)
primal_optimizer = cooper.optim.ExtraAdam([x], lr=0.2)

# Define the dual optimizer, not fully instantiated.
dual_optimizer = cooper.optim.partial_optimizer(cooper.optim.ExtraAdam, lr=0.05)

# Wrap the optimizer together.
constraint_optimizer = cooper.ConstrainedOptimizer(formulation, primal_optimizer, dual_optimizer)

# Perform the training.
max_itr = 250
itr = np.arange(max_itr)
loss_itr = np.zeros(max_itr)
cons_itr = np.zeros(max_itr)
for i in range(max_itr):
    # Print iteration updates.
    loss_itr[i] = float(f(x))
    cons_itr[i] = float(g(x))
    print(f"Itr: {i:7,d}, loss {loss_itr[i]:4g}, conts {cons_itr[i]:4g}")

    # Perform actual minimization.
    constraint_optimizer.zero_grad()
    lagrangian = formulation.composite_objective(cmp.closure, x)
    formulation.custom_backward(lagrangian)
    constraint_optimizer.step(cmp.closure, x)

# Exact solution is at (0.5, 0.5)
print("")
print("Result is: ", x.detach().numpy())
print("Error to exact is: ", float(torch.norm(x-torch.tensor([0.5, 0.5]))))

# Plot the convergence graph.
fig, ax = plt.subplots(1, 2)
ax[0].set_ylabel(f"Cost")
ax[0].set_xlabel(f"Iteration number")
ax[0].plot(itr, loss_itr)
ax[1].set_ylabel(f"Constraint")
ax[1].set_xlabel(f"Iteration number")
ax[1].plot(itr, cons_itr)
plt.show()

References

J. C. Platt en A. H. Barr, ‘Constrained differential optimization’, in Proceedings of the 1987 international conference on neural information processing systems, in NIPS’87. Cambridge, MA, USA: MIT Press, 1987, pp. 612-621.
K. Crowson (crowsonkb), ‘mdmm’, available on GitHub

Fix bibtex for Cooper

Bug

When using the bibtex in the README: "How to cite this work?", the name of Cooper does not render the correct capital letters.

The bibtex entry in the README requires an edit on its curly braces in order to avoid this.

Restart dual_optimizer state when performing dual restarts

Enhancement

When a dual restart is triggered, the dual variables are reset to their initial value 0.
Nonetheless, the state of the primal and dual optimizer remains the same. This may include running averages for momentum mechanisms.

These could be reset along with dual restarts when feasibility is achieved.

Motivation

This would represent a full reset of the optimization protocol when the constraint is being satisfied. Currently, the reset is "half baked" in the sense that only dual variables are reset.

References

Reset the state of a Pytorch optimizer: https://discuss.pytorch.org/t/reset-adaptive-optimizer-state/14654/5

Modularize ConstrainedOptimizer

Enhancement

Modularize the constrained_optimizer.py script to have a BaseConstrainedOptimizerclass from which specific optimization approaches can stem.

For instance, a SimultaneousOptimizer, AlternatingOptimizer, ExtrapolationOptimizer, and in the future perhaps an ExtrapolationFromThePastOptimizer.

Motivation

Current implementation of ConstrainedOptimizer is very flexible: it allows for unconstrained optimization, constrained optimization with simultaneous and alternating updates and extragradient updates.

Because of this, the logic inside the ConstrainedOptimizer class has become overly complex.

cooper/cooper/constrained_optimizer.py

Lines 268 to 270 in f1d9554

    
           # TODO (JGP): The logic inside this method is becoming overly complex 
        
           # due to the constant friction between extrapolation, alternating 
        
           # updates, and proxy-constraints. We might want to consider refactoring.

A more modularized implementation would make the code easier to understand and less error prone. The added flexibility would enable implementing new optimization approaches, like extrapolation from the past [1].

A similar approach was taken with the Formulations, where a BaseLagrangianFormulation underpins the LagrangianFormulation and ProxyLagrangianFormulation.

References

Gidel, G., Berard, H., Vignoud, G., Vincent, P., & Lacoste-Julien, S. (2019). A variational inequality perspective on generative adversarial networks. In ICLR.

how to use cuda acceleration

Hi, I was wondering if cooper can use cuda to accelerate the computation. (I thought it could, but there is no place to claim something like probs.cuda() )

Add documentation for checkpointing

Enhancement

Checkpointing capacities were enabled in #27, but the documentation and a minimal example have not been written.

Typo in the documentation for Lagrangian Formulation

Bug

There is a minus sign on the documentation of Lagrangian formulations which should not be there.

The Lagrange multipliers aim to maximize the Lagrangian (Boyd, $5.1). Therefore,

The current file says that the multiplier maximizes -L instead.

cooper/docs/source/lagrangian_formulation.rst

Line 42 in de1c6fe

:math:`-\mathcal{L}(x,\lambda)` (or equiv. minimize

Steps

The minus sign should be erased, so that the explanation is mathematically correct.

Expected behavior

Correct documentation.
No changes required in the source code.

Environment

No environment needed.

Context

Error on documentation

Do `maximize=True` for dual_optimizers

Enhancement

Pytorch optimizers include a maximize flag (Pytorch Issue)*. When set to True, the sign of gradients is flipped inside optimizer.step() before computing parameter updates. This enables gradient ascent steps natively.

NOTE: This sign flip does not affect the parameter's .grad attribute.

Cooper currently populates the gradients of dual variables with their negative value, so that descent steps performed by the dual optimizer are in fact ascent steps towards optimizing the problem formulation.

cooper/cooper/constrained_optimizer.py

Lines 401 to 406 in 09df759

    
           # Flip gradients for multipliers to perform ascent. 
        
           # We only do the flipping *right before* applying the optimizer step to 
        
           # avoid accidental double sign flips. 
        
           for multiplier in self.formulation.state(): 
        
               if multiplier is not None: 
        
                   multiplier.grad.mul_(-1.0)

We should not do the sign flipping manually but rather force set maximize=True when instantiating the dual optimizer.

* This has been implemented on on Pytorch's master branch for every optimizer but LBFGS . On v1.12, Adam, SGD and AdaGrad support the flag, but not RMSProp. An assert could be included to ensure that the requested dual optimizer supports the flag.

⚠ This change would break compatibility with versions of Pytorch prior to 1.12.

Motivation

Manually flipping gradients immediately after calculating them (thus ensuring that this happens before calls to dual_optimizer.step()) is error prone.
Moreover, keeping track of the fact that gradients have a sign flipped is inconvenient.

By implementing this change we would adopt the official Pytorch approach for performing ascent steps.

Alternatives

The current implementation is functional.

References

pytorch/pytorch#68052

Cooper level wrappers for `formulation.custom_backward` and `formulation.composite_objective`

Enhancement

Create wrappers cooper.backward for formulation.custom_backward and cooper.compute_lagrangian for formulation.composite_objective.

This would change the way a user interacts with Cooper. However, it does not represent a change in Cooper's back-end.

Motivation

A cleaner pipeline for the user, where there is no need to access the methods of a formulation:

for step in range(num_steps):
    ...
    constrained_optimizer.zero_grad()
    lagrangian = cooper.compute_lagrangian(formulation, ...)
    cooper.backward(formulation, lagrangian)
    constrained_optimizer.step()

As opposed to the current pipeline:

for step in range(num_steps):
    ...
    constrained_optimizer.zero_grad()
    lagrangian = formulation.composite_objective(...)
    formulation.custom_backward(lagrangian)
    constrained_optimizer.step()

Add test for alternating updates

Enhance Testing

Alternating updates are currently untested. Should add a test for this.

Augmented Lagrangian Method

Enhancement

Implement the augmented Lagrangian method (ALM; aka "method of multipliers").

Motivation

The ALM (approximately) solves a sequence of unconstrained minimization problems on the primal variables, and updates the multipliers at the end of each minimization. Explicit Lagrange multiplier estimates help avoid the ill-conditioning that is inherent in the quadratic penalty function.

Consider a CMP with objective function $f$ and equality constraints $h=0$.
Given a positive penalty parameter sequence ${c^{k}}$, the ALM solves
$$ x^k = \arg \min_x L_{c^k}(x, \lambda^k) \triangleq f(x) + {\lambda^k}^{\top} h(x) + \frac{c^k}{2} ||h(x)||^2$$
with respect to $x$, and updates the Lagrange multipliers as:
$$ \lambda^{k+1} = \lambda^{k} + c^k h(x^k).$$

The main advantage of the ALM compared to the quadratic penalty method is that (under some reasonable assumptions), the algorithm can be successful without increasing the penalty parameter $c^k \rightarrow \infty$.

Bertsekas (1999) Section 4.2.1 suggests a way to handle inequality constraints.

Alternatives

Tseng and Bertsekas (1993) propose the "exponential multiplier method" (EMM), where an exponential, rather than a quadratic penalty is used. The resulting Lagrangian in EMM is twice differentiable if the objective and constraint functions are.
However, 2nd-order methods are challenging to scale to modern ML problems. Implementing EMM could be considered as an addition to the "standard" ALM, rather than a replacement.

References

Nocedal and Wright - Numerical Optimization - Section17.3
Bertsekas - 1999 - Nonlinear Optimization - Section 4.2.2
Tseng and Bertsekas - 1993 - On the convergence of the exponential multiplier method for convex programming

Multiple primal optimizers

Enhancement

Enable the user to provide several "primal optimizers".
Handle the set of optimizers as "one" optimizer. All operations currently applied to constrained_optimizer.primal_optimizer to be applied simultaneously across the set of optimizers.

See Pytorch Lightning: Optimization for an example where this is implemented.

Motivation

Consider optimization problems where different groups of variables are optimized with different approaches (optimizer classes).
In addition, consider setting different optimization hyper-parameters for each group of variables.
This is currently not supported by Cooper, as one primal optimizer is used to update all primal variables.

In the paper Controlled Sparsity via Constrained Optimization or: How I Learned to Stop Tuning Penalties and Love Constraints, the authors employ different learning rates for the "weights" and "gates" parameters of their models.

They implement this in a hacky way using Cooper: by defining one primal optimizer and setting different parameter_groups to have different learning rates. If they wanted to use different optimizer classes for weights and gates, this approach would not work.

References

Falcon, W., & The PyTorch Lightning team. (2019). PyTorch Lightning (Version 1.4) [Computer software]. https://doi.org/10.5281/zenodo.3828935
Gallego-Posada, J., Ramirez, J., Erraqabi, A., Bengio, Y., & Lacoste-Julien, S. (2022). Controlled Sparsity via Constrained Optimization or: How I Learned to Stop Tuning Penalties and Love Constraints. arXiv preprint arXiv:2208.04425.

Code coverage badge

Enhancement

Finalize the code coverage addition.

Motivation

Code coverage action and badge are part of the Cooper codebase, but they are not fully functional yet.

Example for lagrangian constraints

The documentation of this library is very limited and it is hard to parse how it should be applied for different applications. For instance, in Many RL algorithms such as PPO and MPO, an entropy regularization terms or the KL constraints between policies employed to stabilise standard RL objectives. Is it possible to give a practical example of how one can use this library for a dual problem with lagrangian multipliers for such applications?

Many thanks.

Deprecated `StateLogger`

Bug

The StateLogger in cooper.utils.state_logger.py is not compatible with the latest version of Cooper's dev branch.

It assumes that a Formulation will have a ConstrainedMinimizationProblem as an attribute:

cooper/cooper/utils/state_logger.py

Lines 42 to 48 in 3e0aa6f

    
           for metric in self.save_metrics: 
        
               if metric == "loss": 
        
                   aux_dict[metric] = formulation.cmp.state.loss.item() 
        
               elif metric == "ineq_defect": 
        
                   aux_dict[metric] = deepcopy(formulation.cmp.state.ineq_defect.data) 
        
               elif metric == "eq_defect": 
        
                   aux_dict[metric] = deepcopy(formulation.cmp.state.eq_defect.data)

This is not the case since merging #56.

Steps

Update the interface of StateLogger. store_metrics should receive a CMPState or the unpacked metrics.

Multiple dual optimizers

Enhancement

Enable the user to provide several partially instantiated "dual optimizers". These could be handled "as a single" optimizer by applying all operations currently applied to constrained_optimizer.dual_optimizer simultaneously across the set of optimizers.

See #45 for a PR which enables support for multiple primal optimizers.

Note that dual optimizers are currently instantiated internally by Cooper after the instantiation of dual variables by the formulation. For a LagrangianFormulation, this happens after a first call to cmp.closure.

Therefore, too much granularity or flexibility for the the grouping of dual variables under a shared optimizer is trickier than in #45.
If restricted to groupings of constraints which are easily derived from the formulation/cmp (like inequality and equality constraints), this feature should be relatively easy.

Motivation

Setting up different optimizer classes or hyperparameters across (groups of) constraints. For instance, one for equality constraints and another for inequality constraints.

Original Idea

A natural follow-up feature for [#45] would be [to allow for] multiple (partially instantiated) dual optimizers.

It might be tricky to allow for too much granularity (say a different optimizer for each constraint). But allowing for different optimizers for equality and inequality constraints could be a good first step.

However, this does not seem to be a pressing feature to implement at the moment.

Originally posted by @gallego-posada in #45 (comment)

Extrapolation from the past

Enhancement

Implement the extrapolation from the past algorithm (Popov, 1980). A good and modern source is Gidel et al. (2019).

This is an algorithm for computing parameter updates similar to extragradient: it computes the direction for updating parameters based on a "lookahead step". It is less intensive computationally than extragradient and enjoys similar convergence results for some class problems (Gidel et al., 2019).

Motivation

Whereas extragradient requires two gradient computations per parameter update, extrapolation from the past stores and re-uses gradients from previous extrapolation steps for use during current extrapolation steps. This means less computational intensity in terms of gradient calculations, which may be helpful in some settings.

However, storing the previous gradients still means an overhead in terms of storage as opposed to gradient descent-ascent.

References

G. Gidel, H. Berard, G. Vignoud, P. Vincent, S. Lacoste-Julien. A Variational Inequality Perspective on Generative Adversarial Networks. In ICLR, 2019.
L. D. Popov. A modification of the arrow-hurwicz method for search of saddle points. Mathematical
notes of the Academy of Sciences of the USSR, 1980.

bad path for install instructions

(first issue, yay!!!)

I get an error if I try to install cooper following the readme

I suggest changing the pip install command for this one that works fine for me:

pip install git+https://github.com/cooper-org/cooper.git

Loading state_dict of dual_schedulers

Bug

The instantiation of the dual_scheduler inside ConstrainedOptimizer.load_from_state_dict method has a bug. The variable containing the dual scheduler class is called dual_scheduler_class as opposed to dual_scheduler.

cooper/cooper/constrained_optimizer.py

Line 491 in f75dff3

dual_scheduler = dual_scheduler(dual_optimizer)

Steps

Replace current line to dual_scheduler = dual_scheduler_class(dual_optimizer)

Expected behavior

This should enable the loading of a dual_scheduler from a checkpoint.

A tutorial example for the generation of time series data

Enhancement

A tutorial example for the generation of time series data w.r.t. to constrains on the values.

An open question to me is if the inequalities below should result in a scalar ineq_defect tensor, i.e., so I need to sum up the defects per time step, or if it is fine the way it is.

Motivation

I want to use cooper to create experiments, e.g. for system identification. In order to ensure that the commanded inputs do not destroy the device, I want the resulting trajectories to obey some device-dependent constraints. I am interested how non-differentiable constraints can be added in this context (yes, I saw the classification demo but am still a bit confused).

Alternatives

There are for sure plenty, since I just started playing with cooper 1h ago :)

Suggested Starting Point

I know this code is unpolished, but I first wanted to know what you think of this idea before I put work into it.

from typing import Callable

import cooper
import matplotlib.pyplot as plt
import torch


def generate_traj(time: torch.Tensor, coeffs: torch.Tensor) -> torch.Tensor:
    """ Helper function to generate a 5th order polynomial. """
    return (
        coeffs[0]
        + coeffs[1] * time
        + coeffs[2] * time ** 2
        + coeffs[3] * time ** 3
        + coeffs[4] * time ** 4
        + coeffs[5] * time ** 5
    )


class MinDeviation(cooper.ConstrainedMinimizationProblem):
    """
    Generate a trajectory which has the minimal L2 distance from a reference trajectory, which satisfying
    an (arbitrarily chosen) inequality constraint on its values.
    """
    
    def __init__(
        self,
        traj_ref: torch.Tensor,
        time_ref: torch.Tensor,
        traj_gen_fcn: Callable[[torch.Tensor, torch.Tensor], torch.Tensor],
    ):
        # I know this class is far from elegant, but let's keep it for now.

        super().__init__(is_constrained=True)
        self.traj_ref = traj_ref.detach().clone()
        self.time_ref = time_ref.detach().clone()
        self.traj_gen_fcn = traj_gen_fcn
        self.loss_fcn = torch.nn.MSELoss()

    def closure(self, coeffs: torch.Tensor) -> cooper.CMPState:
        # Generate the trajectory which we want to constrain.
        traj = self.traj_gen_fcn(self.time_ref, coeffs)

        # Evaluate the cost function.
        loss = self.loss_fcn(self.traj_ref, traj)

        # Evaluate the equality constraints g(x) = 0, i.e., the violation thereof.
        eq_defect = None

        # Evaluate the inequality constraints h(x) <= 0, i.e., the violation thereof.
        ineq_defect = traj - torch.tensor([7.0])  # all values should be less or equal than 7

        return cooper.CMPState(loss, ineq_defect, eq_defect)


if __name__ == "__main__":
    # Configure the experimental design.
    time = torch.linspace(0, 2, 51)
    coeffs = torch.tensor([4, -3, 3, 2, 0.5, -1])
    fig, axs = plt.subplots(1, 1, figsize=(12, 8))

    traj_unconstr = generate_traj(time, coeffs)
    axs.plot(time.numpy(), traj_unconstr.numpy())

    # Define the problem and its Lagrangian formulation.
    cmp = MinDeviation(traj_ref=traj_unconstr, time_ref=time, traj_gen_fcn=generate_traj)
    formulation = cooper.LagrangianFormulation(cmp)

    # Define the primal parameters and optimizer.
    coeffs_param = torch.nn.Parameter(coeffs.clone())  # start with the param values of the unconstrained traj
    primal_optimizer = cooper.optim.ExtraSGD([coeffs_param], lr=3e-3, momentum=0.7)

    # Define the dual optimizer. Note that this optimizer has NOT been fully instantiated
    # yet. Cooper takes care of this, once it has initialized the formulation state.
    dual_optimizer = cooper.optim.partial_optimizer(cooper.optim.ExtraSGD, lr=9e-4, momentum=0.7)

    # Wrap the formulation and both optimizers inside a ConstrainedOptimizer
    coop = cooper.ConstrainedOptimizer(formulation, primal_optimizer, dual_optimizer)

    # Here is the actual training loop.
    # The steps follow closely the `loss -> backward -> step` Pytorch workflow.
    for iter in range(3000):
        coop.zero_grad()
        lagrangian = formulation.composite_objective(cmp.closure, coeffs_param)
        formulation.custom_backward(lagrangian)
        coop.step(cmp.closure, coeffs_param)

        assert not torch.any(torch.isnan(coeffs_param)), f"Observed the first NaN value at iter {iter}!"

    # Get trajectory with the final coefficients.
    with torch.no_grad():
        traj_constr = generate_traj(time, coeffs_param)

    axs.plot(time.numpy(), traj_constr.numpy(), ls="--")
    plt.show()

Question: how / when to use proxy inequalities?

Even after reading the classification tutorial, I did not fully understand when and how to use the proxy_ineq_defect.

In my case, I am running a dynamics simulation given a set of inputs that result from a parameterized module.
These parameters should be optimized such that the observations made from the (differentiable) sim suffice some criteria.
These criteria could be that at all time steps the 2nd dim of the obs should be < 0.5 or that 90% of the values of the 3rd obs dim should be > 0.8. Let's say these criteria can be arbitrary.

What works so far is, that I have a function that computes the ineq_defect passed to the CMP's state.
Using plain inequalities, I can optimize the input generator such that these criteria are met.
During this process, I run exactly one forward sim of my deterministic system.

Now, I understood your tutorial in a way that I thought if I want to have a non-differentiable criterion (e.g., the 90% one form above) I need to use the proxy_ineq_defect.
However, when I use the same function as for the ineq_defect (while removing that one from the CMP state), I get this error:
AttributeError: 'NoneType' object has no attribute 'mul_'. I originates from the dual_step method.
Any hints what I am missing?

P.S: When opening this issue, I could only choose between bug and enhancement which both don't apply here.

Provide more "real-life" example in README

Enhancement

Modify README to include a more comprehensive example of how to use Cooper for machine learning problems. Some features to cover include:

Integration with torch.nn.Module
Using CUDA acceleration
Data loaders and mini-batching

We could to have two versions:

One with "off-the-shelf Cooper": no explicit user-defined closure
One with explicit closure definition, which enables the use of advanced features like extrapolation optimizers

Maybe only include the first one in the README, and keep the second version for the docs.

Non comprehensive `init` files

Enhancement

__init__ files throughout Cooper are not fully consistent.

For instance, Cooper.__init__ imports UnconstrainedFormulation and LagrangianFormulation , but not AugmentedLagrangianFormulation.

cooper/cooper/__init__.py

Lines 18 to 22 in a8b3400

    
           from cooper.formulation import ( 
        
               Formulation, 
        
               LagrangianFormulation, 
        
               UnconstrainedFormulation, 
        
           )

For consistency, we should import the AugmentedLagrangianFormulation as well.

Note: Creating an issue as there may be other instances of this throughout the library.

RuntimeError: Trying to backward through the graph a second time

Bug

Hello, i have discovered your library recently and i am trying to use it for my research. While trying to solve a constrained minimization problem, i encountered an error.
Here is the full error stack in a simplified case

Traceback (most recent call last):
  File "/home/david.danan/psinns/Tests/ReportIssue.py", line 137, in <module>
    EnergySolver(nodes=gridNodes,
  File "/home/david.danan/psinns/Tests/ReportIssue.py", line 112, in EnergySolver
    coop.step(loss.closure,predictedField,positions)
  File "/opt/miniconda/envs/FEMINNSEnv/lib/python3.9/site-packages/cooper/constrained_optimizer.py", line 222, in step
    lagrangian = self.formulation.composite_objective(
  File "/opt/miniconda/envs/FEMINNSEnv/lib/python3.9/site-packages/cooper/lagrangian_formulation.py", line 232, in composite_objective
    cmp_state = closure(*closure_args, **closure_kwargs)
  File "/home/david.danan/psinns/Tests/ReportIssue.py", line 84, in closure
    duxdxyz = grad(displacement[:, 0].unsqueeze(1), position, torch.ones(position.size()[0], 1, device=device), create_graph=True, retain_graph=True)[0]
  File "/opt/miniconda/envs/FEMINNSEnv/lib/python3.9/site-packages/torch/autograd/__init__.py", line 300, in grad
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.

As far as i can understand, the case where the loss function requires an explicit call to autograd to compute its value is not handled for now.

I had tried to redefine the behaviour of cooper.LagrangianFormulation by inheriting from it and adding an option to the backward call

class LagrangianFormulationConserveGraph(cooper.LagrangianFormulation):
    @no_type_check
    def _populate_gradients(
        self, lagrangian: torch.Tensor, ignore_primal: bool = False
    ):

        if ignore_primal and self.cmp.is_constrained:
            pass
        else:
            lagrangian.backward(retain_graph=True)

        if self.cmp.is_constrained:
            for violation_for_update in self.state_update:
                dual_vars = [_ for _ in self.state() if _ is not None]
                violation_for_update.backward(inputs=dual_vars)

The error did disappear but the optimization was stucked (neither the loss nor the constraints changes throught the iterations). Therefore, i guess it was not the correct way to do it.

Steps

The full script is enclosed below and should be enough to reproduce this bug.

Expected behavior

Be able to run the script and seing actual change throught the iteration process regarding the loss and constraints

Environment

For cooper, i followed the instruction on the Readme (pip)
cooper: 0.1.dev8+geae7c5a
All the other dependencies where installed within a conda environment:
-python=3.9.16
-numpy=1.22.3
-torch: 1.13.1

Context

Since the whole script has some external dependencies not related to this issue, i share below the problematic part mentionned in the error stack in a simplified shorter version.

In a nutshell, here is what i am trying to do:
-i had an input field that i know (input) and another that i don't (output)
-both fields live on a grid ( 3 values per nodes in both cases)
-Using a neural network, i would like to describe the relationship between these 2 fields
-The loss function denotes a property of the output field, the optimal solution should minimize this quantity
-The constraints enforce the value of the field on a specific part of the grid (here, 0).

The only explicit dependencies within the script are numpy, cooper and torch.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import numpy as np

import torch
from torch.autograd import grad
import torch.nn as nn

import cooper
from cooper.problem import CMPState, Formulation

device = torch.device('cpu')
if torch.cuda.is_available():
    print("CUDA is available, running on GPU")
    device = torch.device('cuda')
    torch.set_default_tensor_type('torch.cuda.FloatTensor')
else:
    print("CUDA not available, running on CPU")

class TorchStraightForwardBuilder():
    def __init__(self,
                 operations):
        self.operations=operations

    def BuildModel(self):
        seqOperations=self.BuildSeqOperationsFromBlockOperations()
        return TorchClassicalStraightForwardNetwork(operations=seqOperations)

    def BuildSeqOperationsFromBlockOperations(self):
        seqOperations=[self.BuildNNComponent(nnComponent) for nnComponent in self.operations]
        return seqOperations

    def BuildNNComponent(self,nnComponent:tuple):
        brickName,params=nnComponent
        if params is None:
            return CreateTorchComponent(brickName)
        return CreateTorchComponent(brickName,params)


class TorchClassicalStraightForwardNetwork(nn.Module):
    def __init__(self,operations:list):
        super(TorchClassicalStraightForwardNetwork, self).__init__()
        self.allComponents=operations
        self.torchComponents = nn.ModuleList([op for op in operations if isinstance(op,nn.Module)])

    def forward(self, x):
        inputVal=x
        for component in self.allComponents:
            output=component(inputVal)
            inputVal=output
        return output

def CreateTorchComponent(name,ops=None):
    if ops==None:
        if name=="ReLU":
            return nn.ReLU()
        else:
            raise Exception("Activation function not recognized")
    else:
        if name=="Linear":
            return nn.Linear(**ops)
        else:
            raise Exception("Layer not recognized")


def trapz(y,dx,axis=-1):
    nd = y.ndimension()
    slice1 = [slice(None)] * nd
    slice2 = [slice(None)] * nd
    slice1[axis] = slice(1, None)
    slice2[axis] = slice(None, -1)
    integralValue = torch.sum(dx * (y[tuple(slice1)] + y[tuple(slice2)]) / 2.0, axis)
    return integralValue

class MinimizationEnergyProblem(cooper.ConstrainedMinimizationProblem):
    def __init__(self,
                 constraintCondition,
                 integrationDomain):
        super().__init__(is_constrained=True)
        self.constraintIndices,self.constraintValues=constraintCondition
        self.integrationDomain=integrationDomain

    def closure(self, displacement, position):
        duxdxyz = grad(displacement[:, 0].unsqueeze(1), position, torch.ones(position.size()[0], 1, device=device), create_graph=True, retain_graph=True)[0]
        internalEnergy = duxdxyz[:, 0].unsqueeze(1) + 1
        dx,dy,dz=self.integrationDomain["spacing"]
        nx,ny,nz=self.integrationDomain["nbNodes"]
        integrand = internalEnergy.reshape(nx,ny,nz)
        internalEnergyIntegrated=trapz(trapz(trapz(integrand, dx=dz), dx=dy), dx=dx)
        eq_defect=torch.flatten(displacement[self.constraintIndices] - self.constraintValues)
        return cooper.CMPState(loss=internalEnergyIntegrated, eq_defect=eq_defect)

def EnergySolver(nodes,model,constraintCondition,integrationDomain):
    modelFC=TorchStraightForwardBuilder(model)
    modelOutput=modelFC.BuildModel().to(device)
    loss = MinimizationEnergyProblem(constraintCondition=constraintCondition,
                              integrationDomain=integrationDomain)

    formulation = cooper.LagrangianFormulation(loss)#LagrangianFormulationConserveGraph(loss)
    positions=torch.from_numpy(nodes).to(device).requires_grad_(True)
    predictedField = modelOutput(positions)

    primal_optimizer = cooper.optim.ExtraSGD(modelOutput.parameters(), lr=3e-3, momentum=0.5)
    dual_optimizer = cooper.optim.partial_optimizer(cooper.optim.ExtraSGD, lr=9e-3, momentum=0.5)

    coop = cooper.ConstrainedOptimizer(formulation, primal_optimizer, dual_optimizer)
    state_history = cooper.StateLogger(save_metrics=["loss", "eq_defect", "eq_multipliers"])
    for iter_num in range(10):
        coop.zero_grad()
        lagrangian = formulation.composite_objective(loss.closure,predictedField,positions)
        formulation.custom_backward(lagrangian)
        coop.step(loss.closure,predictedField,positions)

if __name__ == '__main__':
    model=[
        ("Linear",{"in_features":3, "out_features":50}),
        ("ReLU",None),
        ("Linear",{"in_features":50, "out_features":50}),
        ("ReLU",None),
        ("Linear",{"in_features":50, "out_features":3})
          ]

    nNodesX,nNodesY,nNodesZ=10,20,30
    x = np.linspace(0, 1, nNodesX).astype(np.float32)
    y = np.linspace(0, 1, nNodesY).astype(np.float32)
    z = np.linspace(0, 1, nNodesZ).astype(np.float32)
    x,y,z=np.meshgrid(x,y,z)
    x,y,z=x.flatten(),y.flatten(),z.flatten()
    gridNodes=np.vstack((x,y,z)).transpose()

    constraintIndices=np.where(x==0)[0]
    constraintCondition=np.where(x==0)[0],torch.zeros((len(constraintIndices),3))
    
    hSpacings=1/(nNodesX-1),1/(nNodesY-1),1/(nNodesZ-1)
    integrationDomain={"spacing":hSpacings,"nbNodes":(nNodesX,nNodesY,nNodesZ)}

    EnergySolver(nodes=gridNodes,
                 model=model,
                 constraintCondition=constraintCondition,
                 integrationDomain=integrationDomain)

Remove code repetition in tests

Enhancement

The current code for tests includes substantial code repetition. A more systematic use of fixtures can help reduce this code redundancy.

	if self.dual_scheduler is not None:
	# Do a step on the dual scheduler after the actual step on
	# the dual parameters. Intermediate updates that take
	# place inside the extrapolation process do not perform a
	# call to the scheduler's step method
	self.dual_scheduler.step()

	# TODO (JGP): The logic inside this method is becoming overly complex
	# due to the constant friction between extrapolation, alternating
	# updates, and proxy-constraints. We might want to consider refactoring.

	# Flip gradients for multipliers to perform ascent.
	# We only do the flipping right before applying the optimizer step to
	# avoid accidental double sign flips.
	for multiplier in self.formulation.state():
	if multiplier is not None:
	multiplier.grad.mul_(-1.0)

	for metric in self.save_metrics:
	if metric == "loss":
	aux_dict[metric] = formulation.cmp.state.loss.item()
	elif metric == "ineq_defect":
	aux_dict[metric] = deepcopy(formulation.cmp.state.ineq_defect.data)
	elif metric == "eq_defect":
	aux_dict[metric] = deepcopy(formulation.cmp.state.eq_defect.data)

	from cooper.formulation import (
	Formulation,
	LagrangianFormulation,
	UnconstrainedFormulation,
	)