davda54 / sam Goto Github PK
View Code? Open in Web Editor NEWSAM: Sharpness-Aware Minimization (PyTorch)
License: MIT License
SAM: Sharpness-Aware Minimization (PyTorch)
License: MIT License
Hi
Thank you for your great Pytorch implementation! I'm wondering if there is a plan to support multi devices training (with data parallelism). As you might know, SAM benefits a lot from computing a different epsilon for each replica (section 4.1 of the paper on the m-sharpness), and the best generalization is often obtained when training on multiple GPUs or TPUs. It would be an awesome addition to this codebase.
@davda54 Thank you for providing the implementation of pytorch. The following is about the implementation of gradient accumulation. I am not sure whether the code is correct or not, so I hope to get your help~
Hi, is there anyone who has reproduced the results reported by the paper on ImageNet with ResNet-18 or ResNet-50?
Hi,
I am currently running some experiments with SAM. Previously I used Jax/Flax code provided by the authors of SAM. However for some reasons, I also have to run experiments with pytorch version. Then I found your pytorch implementation of SAM/ASAM.
First of all thank you for your work, it saved a lot of work and time for me. Especially listing known issues in Readme was very helpful.
I realized that Multi GPU version of SAM is yet to be implemented (and also not planned to be implemented in near future according to your comments in a closed issue). So I have following questions regarding Multi GPU settings.
Once again thank you for the great work!
Hi, nice to meet you.
Your SAM is exciting paper. Very good job! Thank you!
I want to use SAM optimizer with pytorch lightning. But I had an error, and I can not find any information about it.
Could you kindly help me?
Here is my code:
class my_class(LightningModule):
def __init__(self):
self.SAM = True
if self.SAM:
self.automatic_optimization = False
def training_step(self, batch, batch_idx):
loss = self.shared_step(batch)
if self.SAM:
optimizer = self.optimizers()
# first forward-backward pass
self.manual_backward(loss, optimizer)
optimizer.first_step(zero_grad=True)
# second forward-backward pass
loss_2 = self.shared_step(batch)
self.manual_backward(loss_2, optimizer)
optimizer.second_step(zero_grad=True)
return loss
def configure_optimizers(self):
if self.SAM:
base_optimizer = torch.optim.SGD
optimizer = SAM(self.parameters(), base_optimizer, lr=0.01)
return optimizer
Hello,
Thanks for the nice implementation! I have a question about the Cutout augmentation in the code below.
class Cutout:
def __init__(self, size=16, p=0.5):
self.size = size
self.half_size = size // 2
self.p = p
def __call__(self, image):
if torch.rand([1]).item() > self.p: return image
left = torch.randint(-self.half_size, image.shape[0] - self.half_size, [1]).item()
top = torch.randint(-self.half_size, image.shape[1] - self.half_size, [1]).item()
right = min(image.shape[0], left + self.size)
bottom = min(image.shape[1], top + self.size)
image[max(0,left):right, max(0,top):bottom, :] = 0
return image
It seems like here you are assuming that the shape of image
is [Height, Width, Channel]
, but actually here the input image
is in [Channel, Height, Width]
. I think the Cutout function need to be modified so that it assumes image
to be the shape as [Channel, Height, Width]
. Is my understanding correct here, or maybe I'm missing anything? Thanks!
could you add a simple cifar replicate-able example for pytorch lightning please?
Hello! I've been getting very good results using SAM. I stumbled on this paper, ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks, from Samsung Research improving upon SAM. ASAM is SAM but scale invariant, so it minimizes a relative neighborhood of weights vs. a perfect circle of weights and further improves generalization. Would you like to help me implement it?
They add an operator, T_w, described on page 8 in their paper, into the epsilon equation in a couple places. I am using their element-wise T_w. They show their epsilon equation on page 11 under where it says "for t = 0, 1, 2, · · · . Especially, if p = 2,", multiplying the grads by T_w before the norm calc, and multiplying the epsilon numerator by T_w squared.
This is my thought for the norm calc. I am not sure if I have it right, if I need to do something with torch.diag or not.
norm = torch.norm(
torch.stack([
(torch.abs(p) * p.grad).norm(p=2).to(shared_device)
for group in self.param_groups for p in group["params"]
if p.grad is not None
]),
p=2
)
And this is my thought for epsilon, but I again am not sure if I need to do something with torch.diag, or need to flatten and concat things in a certain way or not.
e_w = torch.pow(p, 2) * p.grad * scale.to(p)
Any thoughts? I'm sorry if you are busy, but figured I'd ask! I've only tested on mnist so far, and did get a slightly improved result vs. SAM, so I think this is at least on the right track if not correct.
Hi, may i know how is the setting of m-sharpness ?
Is it to set the distributed training, where each GPU receives m data?
Thanks in advance.
What is the proper way to perform gradient clipping with the SAM/ASAM optimizer? Should it be performed before both of the optimizer steps or just one of them?
Code example:
# first forward-backward pass
loss = loss_function(output, model(input)) # use this loss for any training statistics
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), v)
optimizer.first_step(zero_grad=True)
# second forward-backward pass
loss_function(output, model(input)).backward() # make sure to do a full forward pass
torch.nn.utils.clip_grad_norm_(model.parameters(), v)
optimizer.second_step(zero_grad=True)
Hi,
Thank you for your implement.
I have a question about rho. Should the value of rho be adjusted according to lr? If the rho is much larger than lr, perhaps it will cause the model to fail to converge. And if I use the lr scheduler to adjust lr, should rho to be adjusted?
Hi!
Thanks for your great work! When I follow your instructions and insert SAM in my code, it works well. But the training loss collapses every time I resume the model. So I suspect that I missed some critical steps to resume SAM. Do you have some suggestions?
thank you!
from utility.bypass_bn import import enable_running_stats, disable_running_stats
Hi. What a great work. Thanks for sharing your work.
I try to add your SAM optimizer in addition to the mixed precision training which is provided on Pytorch > 1.7.0.
There's no guideline about this. Here what I've been trying.
...
scaler.cale(loss).backward()
...
scaler.step(optimizer.first_step)
scaler.update()
optimizer.zero_grad()
loss = loss_fn(image_preds, image_labels)
scaler.scale(loss).backward()
scaler.step(optimizer.second_step)
scaler.update()
optimizer.zero_grad()
Please let me know how to incorporate this into the mixed precision training process.
Moreover, It'd be better to clarify what versions of Pytorch are available.
Hi,
Thanks for sharing this good repo. I'm wondering that whether it is suitable to combine SAM with GAN training, like WGAN-GP loss? Normally WGAN-GP needs to compute gradient penalty, I'm not sure if this conflicts with SAM. Do you have any idea? Thanks!
We're interested in integrating SAM into HomebrewNLP, and straight-up copying the code without copyright or notice doesn't sit right with me.
How would you prefer us to cite you?
I propose adding a header to sam.py as it's done in Shampoo.
Hello, the error rate of cifar100 in your paper is 3.92%. Is this the training error rate, or is it the error rate obtained by applying the trained model to the test set? Looking forward to your reply.
Hi,
is this the correct way of performing lr scheduling?
base_optimizer = torch.optim.Adam
optimizer = SAM(net.parameters(), base_optimizer, lr=3e-2, weight_decay=1e-8)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer.base_optimizer, T_max=200)
Pytorch throws a warning if I perform lr scheduling directly on SAM.
I am trying to reproduce the results for WRN-28-10 (SAM) trained on 10-class classification SVHN dataset (Percentage Error 0.99) - https://paperswithcode.com/sota/image-classification-on-svhn
I'm able to train WRN-28-10 using https://github.com/hysts/pytorch_wrn (Modified the script to incorporate SAM into it)
I'm achieving a test accuracy of 93%. How can I replicate the SOTA Percentage Error 0.99 for WRN-28-10 (SAM). Which hyperparameters do I use?
Any help is appreciated!!
Hi.
This is Jungmin Kwon, one of the authors of Adaptive Sharpness Aware Minimization (ASAM).
We really appreciate your great implementation!
I have performed cifar10 tests with your code and we found that ASAM with rho=2.0 shows the best accuracy among [0.5, 1.0, 2.0, 5.0].
The test error rates obtained from the grid search are as follows:
rho | Test error rate |
---|---|
0.5 | 2.75 % |
1.0 | 2.69 % |
2.0 | 2.55 % |
5.0 | 2.90 % |
In our implementation without bias (or beta for BatchNorm) normalization (https://github.com/SamsungLabs/ASAM), ASAM with rho=0.5 shows the best accuracy (2.37 % for WRN16-8), so we performed all the cifar10 tests with rho=0.5.
If you don't mind, could you update the table of test error rate with the result of rho=2.0 (2.55 %)?
Thank you.
hi!
thanks for your awesome work!
After comparing your code with google's official implementation,
I think there may be something wrong with your code about Batch Normalization's running mean and running var.
As there are 2 forward passes when using SAM, so the running mean and running var will be computed twice on the same data input in your code. But in google's implementation, they save the state(Parameters that will not be calculated gradient, including the running mean and running var) after the first forward pass as inner_state , and then return it as the final state in this step. that's to say they only calculated the running mean and running var one time.
here is google's code :
def get_sam_gradient(model: flax.nn.Model, rho: float):
"""Returns the gradient of the SAM loss loss, updated state and logits.
See https://arxiv.org/abs/2010.01412 for more details.
Args:
model: The model that we are training.
rho: Size of the perturbation.
"""
# compute gradient on the whole batch
(_, (inner_state, _)), grad = jax.value_and_grad(
lambda m: forward_and_loss(m, true_gradient=True), has_aux=True)(model)
if FLAGS.sync_perturbations:
if FLAGS.inner_group_size is None:
grad = jax.lax.pmean(grad, 'batch')
else:
grad = jax.lax.pmean(
grad, 'batch',
axis_index_groups=local_replica_groups(FLAGS.inner_group_size))
grad = dual_vector(grad)
noised_model = jax.tree_multimap(lambda a, b: a + rho * b,
model, grad)
(_, (_, logits)), grad = jax.value_and_grad(
forward_and_loss, has_aux=True)(noised_model)
return (inner_state, logits), grad
so, the little difference may affect the performance when evaluation, right?
Thanks for a great job. It seems the first step does not consider the effect of weight decay, which is quite different from the TensorFlow version. Does this have a negative effect on the performance?
Hello,
Thank you for this great implementation. I have a usage question having to do with multi-GPU training.
There is this line from the paper on page 5:
"To compute the SAM update when parallelizing across multiple accelerators, we divide each data batch
evenly among the accelerators, independently compute the SAM gradient on each accelerator, and
average the resulting sub-batch SAM gradients to obtain the final SAM update."
Do you think doing the first SAM step in your implementation on each accelerator, then averaging the gradients from all the accelerators before the second step would accomplish this?
The code would be:
for examples, labels in loader:
loss = loss_fn(network(examples), labels)
loss.backward()
optimizer.first_step(zero_grad=True)
loss_fn(network(examples), targets).backward()
reduce_gradients_from_all_accelerators()
optimizer.second_step(zero_grad=True)
I believe not reducing the gradients between the accelerators before the first step would accomplish what they describe in the paper, but I'm looking for a second opinion. Any help is much appreciated.
I am running your implementation in Colab, and everything is working well (though taking a lot of time even with a GPU runtime). However, when I try using sam in my own setup on the same cifar-10 I get the following error
`Training model...
AttributeError Traceback (most recent call last)
in
in train(train_loader, model, optimizer, epoch)
32 loss = smooth_crossentropy(output, target)
33 loss.mean().backward()
---> 34 optimizer.first_step(zero_grad=True)
35
36 smooth_crossentropy(model(input_), target).mean().backward()
~/anaconda3/lib/python3.7/site-packages/torch/autograd/grad_mode.py in decorate_no_grad(*args, **kwargs)
47 def decorate_no_grad(*args, **kwargs):
48 with self:
---> 49 return func(*args, **kwargs)
50 return decorate_no_grad
51
~/Downloads/mila project/udacity_DP_FL/sam/sam.py in first_step(self, zero_grad)
14 @torch.no_grad()
15 def first_step(self, zero_grad=False):
---> 16 grad_norm = self._grad_norm()
17 for group in self.param_groups:
18 scale = group["rho"] / (grad_norm + 1e-12)
~/Downloads/mila project/udacity_DP_FL/sam/sam.py in _grad_norm(self)
44 torch.stack([
45 p.grad.norm(p=2)
---> 46 for group in self.param_groups for p in group["params"]
47 ]),
48 p=2
~/Downloads/mila project/udacity_DP_FL/sam/sam.py in (.0)
44 torch.stack([
45 p.grad.norm(p=2)
---> 46 for group in self.param_groups for p in group["params"]
47 ]),
48 p=2
AttributeError: 'NoneType' object has no attribute 'norm'`
Hey @davda54 ,
This looks awesome. I wondered if you would be interested in making a PR contribution with SAM in Pytorch Lightning.
I am sure the community would love it.
Best,
T.C
Hi, I have implemented SAM to be used along with PyTorch Lightning, but it crashes at the beginning of the training:
This is a resume of the key lines of my LightningModule that involve SAM:
class pl_module(LightningModule):
def __init__(self, params...):
super().__init__()
self.save_hyperparameters()
self.automatic_optimization = False
def configure_optimizers(self):
optim = SAM(self.parameters(), SGD, lr=0.01)
return optim
def training_step(self, batch, batch_idx):
optimizer = self.optimizers()
def closure():
loss = self.compute_loss(batch)
self.manual_backward(loss, optimizer)
return loss
loss = self.compute_loss(batch)
self.manual_backward(loss, optimizer)
optimizer.optimizer.step(closure=closure)
optimizer.optimizer.zero_grad()
return loss
This is the error:
Traceback (most recent call last):
File "trainer_v2.py", line 409, in <module>
train(folder_name, path_results, data_params, backbone_params, clf_params, contrastive_params,
File "trainer_v2.py", line 311, in train
trainer.fit(model, dm)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 460, in fit
self._run(model)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 758, in _run
self.dispatch()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 799, in dispatch
self.accelerator.start_training(self)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
self.training_type_plugin.start_training(trainer)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
self._results = trainer.run_stage()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in run_stage
return self.run_train()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 871, in run_train
self.train_loop.run_training_epoch()
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 499, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 714, in run_training_batch
self.training_step_and_backward(
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 823, in training_step_and_backward
result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 290, in training_step
training_step_output = self.trainer.accelerator.training_step(args)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 204, in training_step
return self.training_type_plugin.training_step(*args)
File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 155, in training_step
return self.lightning_module.training_step(*args, **kwargs)
File "trainer_v2.py", line 187, in training_step
optimizer.second_step(zero_grad=True)
File "/opt/conda/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/asabater/2021_projects/evnet/sam.py", line 34, in second_step
p.sub_(self.state[p]["e_w"]) # get back to "w" from "w + e(w)"
KeyError: 'e_w'
how to save this problem?
def build_optimizer(config, model):
"""
Build optimizer, set weight decay of normalization to 0 by default.
"""
skip = {}
skip_keywords = {}
if hasattr(model, 'no_weight_decay'):
skip = model.no_weight_decay()
if hasattr(model, 'no_weight_decay_keywords'):
skip_keywords = model.no_weight_decay_keywords()
parameters = set_weight_decay(model, skip, skip_keywords)
opt_lower = config.TRAIN.OPTIMIZER.NAME.lower()
optimizer = None
if opt_lower == 'sgd':
optimizer = optim.SGD(parameters, momentum=config.TRAIN.OPTIMIZER.MOMENTUM, nesterov=True,
lr=config.TRAIN.BASE_LR, weight_decay=config.TRAIN.WEIGHT_DECAY)
elif opt_lower == 'adamw':
optimizer = optim.AdamW(parameters, eps=config.TRAIN.OPTIMIZER.EPS, betas=config.TRAIN.OPTIMIZER.BETAS,
lr=config.TRAIN.BASE_LR, weight_decay=config.TRAIN.WEIGHT_DECAY)
optimizer_sam = SAM(parameters, optimizer, momentum=0.9)
return optimizer_sam
Line 34 in 4acec41
You seem to be getting statistics from both training and test data in order to normalize the inputs. This does not seem like the right way. This should loop only over the training data.
During the training process, is the first loss larger than the second loss? But my situation is the opposite.
thank you very much for your share!
ISSUE SUMMARY: Command-line argument label_smoothing
is silently ignored and hard coded values are being used.
Command-line argument label_smoothing
is set to a default value of 0.1
Line 23 in cdcbdc1
The default argument argument is set at 0.1
in smooth_cross_entropy(smoothing=0.1)
As a result, any argument to label_smoothing
is silently ignored. To fix this, please pass args.label_smoothing
in to the function smooth_cross_entropy()
when it is used in train.py
Lines 53 to 75 in cdcbdc1
Please see: https://github.com/davda54/sam/search?q=smoothing
In your implementation of SAM, you split the optimization into first and second step and the code gives an error if "step" is called because it is not a normal optimizer. This doesn't necessarily have to be true because of the optional argument to step called "closure" which is a feature of optimizers like LBFGS that evaluate the loss and gradients at intermediate states (https://pytorch.org/docs/stable/optim.html?highlight=bfgs#torch.optim.LBFGS). If we use closure, the training routine for SAM looks like
model = YourModel()
base_optimizer = torch.optim.SGD # define an optimizer for the "sharpness-aware" update
optimizer = SAM(model.parameters(), base_optimizer, lr=0.1, momentum=0.9)
for input, output in data:
def closure():
optimizer.zero_grad()
loss = loss_function(output, model(input))
loss.backward()
return loss
loss = loss_function(output, model(input))
loss.backward()
optimizer.step(closure)
optimizer.zero_grad()
where step(closure) is defined as,
def step(self, closure=None):
assert closure is not None, "Sharpness Aware Minimization requires closure but it was not provided"
self.first_step()
closure()
self.second_step()
This change would bring it more in compliance with the standard pytorch optimizer implementation and make it easier to adopt since some fraction of people already implement closure based optimizers. At the very least, it gives people the option of which to adopt.
Hi @davda54. Thank you for open-sourcing this implementation.
Wanted to know the reason behind adding 1e-12
in https://github.com/davda54/sam/blob/main/sam.py#L18?
Thanks for the work!
I'd like to know if the the original code is also applicable to accelerated training, i.e. using automatic mixed precision like fp16. I tried to adopt SAM in my own training codes with apex fp16, but Nan loss happens and the computed grad norm is Nan. When I switch to fp32, it goes on well. Is it incompatible with fp16? What are the suggestions to make the code work with fp16? Thanks!
i encounter this problem,please help me ,thanks a lot
the code is :
output = model(images)
# loss = criterion(output, target)
# SAM
# first forward-backward pass
loss = criterion(output, target) # use this loss for any training statistics
loss.backward()
optimizer.first_step(zero_grad=True)
# second forward-backward pass
criterion(output, target).backward() # make sure to do a full forward pass
optimizer.second_step(zero_grad=True)
# measure accuracy and record loss
acc1, acc5 = accuracy(output, target, topk=(1, 5))
losses.update(loss.item(), images.size(0))
top1.update(acc1[0], images.size(0))
top5.update(acc5[0], images.size(0))
the problen is :
Traceback (most recent call last):
File "C:/softWare/SAM/main.py", line 557, in
main()
File "C:/softWare/SAM/main.py", line 156, in main
main_worker(args.gpu, ngpus_per_node, args)
File "C:/softWare/SAM/main.py", line 340, in main_worker
acc2, loss2=train(train_loader, model, criterion, optimizer, epoch, args, f)
File "C:/softWare/SAM/main.py", line 413, in train
criterion(output, target).backward() # make sure to do a full forward pass
File "C:\softWare\Python37\lib\site-packages\torch\tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "C:\softWare\Python37\lib\site-packages\torch\autograd_init_.py", line 132, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: Trying to backward through the graph a second time, but the saved intermediate results have already been freed. Specify retain_graph=True when calling backward the first time.
please let me know where is wrony
Hi, thank you very much for providing this great repo!!
I'd like to use your SAM implementation in some code where we use torch
's native AMP functionality.
More specifically, we use the torch.cuda.amp.GradScaler class, which invokes the optimizer via scaler.step(optimizer)
. This is slightly tricky because your implemented optimizer differs from the expected interface since it has two optimization steps.
I'm sure there is an elegant way to still use your implementation in such settings. If you have any advice on this, I'd highly appreciate it!
i use SAM train with pytorch lighting.when i use muti-GPU,there is some error.
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [256]], which is output 73 of BroadcastBackward, is at version 4; expected version 3 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Hi @davda54
Thank you for your great Pytorch implementation!
I want to know whether the momentum of SGD can be safely removed when using the SAM optimizer? That is, is momentum necessary for the SAM optimizer? Because momentum has a negative impact on my task.
base_optimizer = torch.optim.SGD
optimizer = SAM(base_optimizer, rho=0.05, lr=0.1, weight_decay=0.0005,momentum=0)
Thank you again and look forward to replying~
On my local machine, the line
from sam import SAM
fails with the error:
ImportError: cannot import name 'Error' from partially initialized module 'sam' (most likely due to a circular import) (d:\...\python38\lib\site-packages\sam\__init__.py)
I'm on Windows 10 using python 3.8.6 and pytorch 1.7.0+cu110. I installed sam via pip. Do I totally overlook something here?
Thanks for implementing this!
what are the argument values passed to train.py in order to train Wide Resnet without SAM or ASAM?
Thanks
Hi,
I am pretty sure that there is an error in the implementation of _grad_norm method.
If we look at the equation 2 in the original paper the denominator contains q-norm of vector grad(w) to the power of q and all that to the power 1/p. In the case of q=2 and p=2 we get (L2(grad(w)) ^ 2) ^ 0.5.
Your implementation stacks L2 norms of each parameter tensor, stacks them and calculate L2 norm of those stacked norm which is not the same as the above formula. You should ravel and concatenate all the parameter tensors and then calculate L2 norm two times.
Please check my reasoning but I am almost sure, that I am right.
Best Regards,
Jacek
Thank you for the great implementation!
Out of curiosity, why have you added your own step_lr
scheduler and are not using the cosine annealing scheduler as suggested in the original paper?
my code
scaler = amp.GradScaler()
...
epoch_losses.update(loss.item(), len(inputs))
scaler.scale(loss).backward()
scaler.update()
How should I apply the code you provided when I do it this way?
Is there an example to use this library with HuggingFace models?
Hi there! I was watching your code and i have a doubt regarding the _grad_nom(self)
method: from the SAM paper the grad norm used to compute epsilon_hat(w) defined by the following equation
supposing p=2 and then q=2 (1/p + 1/q = 1), would simply be the 2-norm of the gradients. So, shouldn't be the _grad_norm(self)
method something like this:
@torch.no_grad()
def _grad_norm(self):
# put everything on the same device, in case of model parallelism
shared_device = self.param_groups[0]["params"][0].device
norm = torch.sqrt(
sum(
[
torch.sum(
torch.square(
p.grad
* (torch.abs(p.grad) if group["adaptive"] else 1.0)
)
).to(shared_device)
for group in self.param_groups
for p in group["params"]
if p.grad is not None
]
),
)
return norm
I've already tested it with SAM obtaining an accuracy of 97.17% and I'm now running tests with ASAM
Hi, is there anyone who applied SAM on the COCO object detection dataset? How do we set the hyper-parameter rho? I have tested rho=0.05 but got a performance degradation compared with SGD.
Hi. Following the suggestion posted in README, I disabled BatchNorm2d as below and then my model would not converge anymore? Did you literally mean BatchNorm2d or SyncBN? Hopefully you can find out what is wrong with this modification. Many thanks!
for epoch in range(args.epochs):
model.train()
log.train(len_dataset=len(dataset.train))
for batch in dataset.train:
inputs, targets = (b.to(device) for b in batch)
# first forward-backward step
enable_bn(model) # update BN in the wk
predictions = model(inputs)
loss = smooth_crossentropy(predictions, targets)
loss.mean().backward()
optimizer.first_step(zero_grad=True)
# second forward-backward step
disable_bn(model) # do not update BN as we are in the perturbation point w_adv
smooth_crossentropy(model(inputs), targets).mean().backward()
optimizer.second_step(zero_grad=True)
train_total_number += targets.shape[0]
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.