Coder Social home page Coder Social logo

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [4, 512, 16, 16]], which is output 0 of ConstantPadNdBackward, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True). about funit HOT 32 OPEN

nvlabs avatar nvlabs commented on July 26, 2024
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [4, 512, 16, 16]], which is output 0 of ConstantPadNdBackward, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

from funit.

Comments (32)

daeyoun24 avatar daeyoun24 commented on July 26, 2024 114

This error can be resolved by setting inplace=False in nn.ReLU and nn.LeakyReLU in blocks.py.

from funit.

ma-xu avatar ma-xu commented on July 26, 2024 72

This error can be resolved by setting inplace=False in nn.ReLU and nn.LeakyReLU in blocks.py.

However, directly seeting inplace =False will certainly decrease the performance. We can check allocated memory using torch.cuda.memory_allocated(). In same cases, it could be almost double.

Another solution is to use clone(). If someone want to operate tensor like x[0,1,:,:], a good choice is x[0,1,:,:].clone()

Any clearer describle? Where to do clone operation for this issue?

If u could post your code here? it depends on your implementation.

I solved my problem like:

out = self.conv3(out)
out = self.norm3(out)
out = self.rgc({0: out, 1: x[1]})
if self.downsample is not None:
    identity = self.downsample(x[0])
out_x = out[0].clone() + identity
out_x = self.relu(out_x)
out_att = out[1]

from funit.

ma-xu avatar ma-xu commented on July 26, 2024 71

This error can be resolved by setting inplace=False in nn.ReLU and nn.LeakyReLU in blocks.py.

However, directly seeting inplace =False will certainly decrease the performance. We can check allocated memory using torch.cuda.memory_allocated(). In same cases, it could be almost double.

Another solution is to use clone(). If someone want to operate tensor like x[0,1,:,:], a good choice is x[0,1,:,:].clone()

from funit.

alik604 avatar alik604 commented on July 26, 2024 34

I just did a target_value = target_value.detach(). Error is gone

from funit.

LiUzHiAn avatar LiUzHiAn commented on July 26, 2024 23

Hi, I met a similar error when I tried to debug my code using DataDistributedParallel, something as follows:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: 
[torch.cuda.FloatTensor [4, 4]] is at version 4; expected version 3 instead. 
Hint: the backtrace further above shows the operation that failed to compute its gradient. 
The variable in question was changed in there or anywhere later. Good luck!

The [torch.cuda.FloatTensor [4, 4]] suggested is actually a registered buffer. After setting the parameter broadcast_buffers=False of torch.nn.parallel.DistributedDataParallel(...,broadcast_buffers=False,... ) , the problem was solved.

If you guys are under a similar scenario, try to set torch.nn.parallel.DistributedDataParallel(...,broadcast_buffers=False,... ). But to be honest, I still don't know what is going wrong. I would appreciate any given ideas.

BTW, I just use one GPU and one node exactly.

from funit.

kovacgrgur avatar kovacgrgur commented on July 26, 2024 15

For me, the error appeared in the pytorch version "1.10.2+cu2". It was caused by calling both F.relu or nn.ReLU() (inplace=False didn't fix it).

I fixed it by manually implementing relu:

def my_relu(x):
    return torch.maximum(x, torch.zeros_like(x))

Curiously, using F.relu6 didn't cause the issue. I suppose that there is some issue with the relu implementation.

from funit.

ma-xu avatar ma-xu commented on July 26, 2024 8

This error can be resolved by setting inplace=False in nn.ReLU and nn.LeakyReLU in blocks.py.

However, directly seeting inplace =False will certainly decrease the performance. We can check allocated memory using torch.cuda.memory_allocated(). In same cases, it could be almost double.

from funit.

GiangHLe avatar GiangHLe commented on July 26, 2024 7

I just did a target_value = target_value.detach(). Error is gone

As I know, doing this means you override target_value by itself without requires_grad, or it means you removed a tensor target_value from a computation graph. How can you guarantee the backpropagation result?

from funit.

namnguyenhai avatar namnguyenhai commented on July 26, 2024 7

For me, the error appeared in the pytorch version "1.10.2+cu2". It was caused by calling both F.relu or nn.ReLU() (inplace=False didn't fix it).

I fixed it by manually implementing relu:

def my_relu(x):
    return torch.maximum(x, torch.zeros_like(x))

Curiously, using F.relu6 didn't cause the issue. I suppose that there is some issue with the relu implementation.

That's right. I also got an error with relu despite using inplace=False. But still error, when not using Relu, there is no error.

from funit.

hbchen121 avatar hbchen121 commented on July 26, 2024 3

This problem occured when I used torch==1.10.0 and disappeared when I used torch==1.7.1

from funit.

MaheepChaudhary avatar MaheepChaudhary commented on July 26, 2024 2

I am facing this error

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: 
[torch.FloatTensor [5, 6]], which is output 0 of TBackward, is at version 2; expected version 1 instead. 
Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

and can't figure out the solution after reading many articles. The code in which this error is

model.py

from config import *

class Lambda(nn.Module):

    def __init__(self,operation):
        self.operation = operation
        super().__init__()
    
    def forward(self,x):
        return self.operation(x)    

class memory(nn.Module):

    def __init__(self,
                 gamma=0.95, 
                 entry=20, 
                 entry_element=5,  
                 n=10,
                 classes = 10, 
                 input_shape = (28,28,1)):
        
        super().__init__()
        global memory_size
        global batch_size
        self.gamma = torch.tensor([gamma],  requires_grad = False)
        self.entry = entry
        self.entry_element = entry_element
        #alpha = tf.Variable(np.random.randint(1), trainable = True, dtype = tf.float32)
        #gate_param = tf.sigmoid(alpha)
        self.n = n
        self.no_of_parameters = entry_element + 1
        self.no_of_classes = classes
        self.input_shape = input_shape
        self.tanh = nn.Tanh()
        self.softmax = nn.Softmax(dim = 1)
        self.classification_softmax = nn.Softmax(dim = 1)
        self.sigmoid = nn.Sigmoid()
        self.einsum = Lambda(lambda a: torch.einsum('abc,abd->adc',a[0], a[1])) #abc(128,20,5), abd(128,20,1)
        self.reduce_sum = Lambda(lambda x : torch.sum(x, axis = 1, keepdims=True))
        self.keras_multiply = Lambda(lambda xy : torch.einsum('abc,def->abf',xy[0], xy[1]))
        # mask = Lambda(lambda x : torch.slice(tf.sort(x, axis=1, direction='ASCENDING', name=None), begin=[0, n, 0], size=[-1, 1, 1]))
        self.mask = Lambda(lambda xn : torch.sort(xn[0],1)[0][:,-xn[1],:].clone())
        self.greater = Lambda(lambda jk : torch.greater(jk[1],jk[0].tile((1,memory_size[0],1))))
        self.controller = nn.LSTM(self.input_shape[-2]*self.input_shape[0]* batch_size,
                                  self.entry_element,
                                  1) #for inserting just 12 LSTM Layer
        self.key_dense = nn.Linear(self.entry_element,self.no_of_parameters)
        self.classification_layer = nn.Linear(self.entry_element,self.no_of_classes)
        

    def forward(self, inputs, state):
        
        i = torch.squeeze(inputs)
        f = torch.flatten(i,start_dim = 1)
        print(f"The shape of f is {f.shape}")
        inp = torch.reshape(f, (1, 1,-1))
        # print(f"The shape of the inputs is {inputs.to(dtype= torch.float32).dtype}")
        out, (h_n, c_n) = self.controller(inp.float())
        # print(f"The types of LSTM outputs are {type(out)} and {type(h_n)} and {type(c_n)}")
        out_1 = torch.tanh_(out)
        print(f"the shape of out is {out_1.shape}")
        out_2 = self.key_dense(out_1)
        p = out_2[:,:,:self.entry_element]
        key = p
        gate_param = torch.squeeze(torch.sigmoid_(out_2[:,:,-1].clone()))
        # gate_param.squeeze_()

        #writing
        # print(gate_param)
        w_w = torch.add(torch.multiply(gate_param.clone(), state['w_r'].clone()), 
                        torch.multiply((1-gate_param), state['w_lu'].clone()))
        
        print(f"ther shape of w_w is {w_w.shape}")
        print(f"The shape of key is {key.shape}")
        write = self.keras_multiply([w_w, key])
        print(state['M'].clone().shape)
        print((write).shape)
        M = torch.add(state['M'].clone(), write)

        #reading
        
        #M_dot_kt =  dot([tile(M, kt])
        print(f"The shape of M is {M.shape} and key is {key.shape}")
        M_dot_kt = torch.matmul(M, torch.squeeze(key)) #(128,20)
        '''The matmul function do the dotproduct of 3D tesnors'''
        M_dot_kt = torch.unsqueeze(M_dot_kt, dim = -1)
        print(f"The shape and type of the M_dot_kt is {M_dot_kt.shape} and {M_dot_kt.dtype} ")
        w_r = self.softmax(M_dot_kt)
        #w_r = M_dot_kt
        
        r_t = self.einsum([M, w_r])
        print(f"The shape of r_t recieved is {r_t.shape}")

        #least used related computation
        # print(f"The shape of gamma is {self.gamma.shape} and w_u is {state['w_u'].shape}")
        print(self.gamma)
        gamma_w_u = torch.multiply(state['w_u'].clone(),self.gamma) #(128,20,1)
        
        w_u = torch.add(torch.add(gamma_w_u, w_r), w_w)
        masked = self.mask([w_u,self.n])
        tile_masked = torch.tile(masked, (1,self.entry))
        tile_masked.unsqueeze_(-1)
        print(f"MAsked shape is {tile_masked.shape}")
        w_lu = torch.greater(w_u, tile_masked)

        states = [r_t, w_r, w_lu, w_u, M]
        # state_w_r = w_r
        # state_w_lu = w_lu
        # state_w_u = w_u
        # state_m = M    
        '''
        next_states = {
            'read_vector': states[0], 
            'w_r': states[1],
            'w_lu': states[2],
            'w_u': states[3],
            'M': states[4],
        }
        '''
        flattened = torch.flatten(r_t, start_dim = 1)
        print(f"The shape of the flattened varaible is {flattened.shape}")
        flattened_output = self.classification_layer(flattened)
        print("The shape of the output is ",flattened_output.shape)
        # output = torch.reshape(flattened_output, (batch_size,self.no_of_classes))
        pred_class = self.classification_softmax(flattened_output)
        output = pred_class

        return {'read_vector': states[0],
                'w_r': states[1],
                'w_lu': states[2], 
                'w_u': states[3], 
                'M': states[4]}, output


    def zero_state(self,batch_size):
        one_hot_weight_vector = torch.tensor(torch.rand([batch_size, self.entry, 1]), requires_grad = False)
        one_hot_weight_vector[..., 0] = 1
        one_hot_weight_vector = torch.tensor(one_hot_weight_vector, requires_grad = False)

        state = {
            'read_vector': torch.tensor(torch.rand([batch_size, 1, self.entry_element]), requires_grad = False),
            'w_r': one_hot_weight_vector,
            'w_lu': one_hot_weight_vector,
            'w_u': one_hot_weight_vector,
            'M': torch.tensor(torch.ones([batch_size,
                                          self.entry, 
                                          self.entry_element], dtype = torch.float32) * 1e-6, requires_grad = False)
        }
        return state

    

main.py

from config import *
from model import memory
from preprocessing import data_batched, data_batched_test


def train(model, device, train_loader, optimizer, epoch):
    global batch_size
    # model.train()
    state = model.zero_state(batch_size)
    for batch_idx, (data, target) in enumerate(train_loader):
        print(f"The batch_idx value is {batch_idx}")
        data, target = data.to(device), target.to(device)
        
        next_state,output = model(data, state)
        loss = nn.CrossEntropyLoss()(output, target).clone()
        # torch.autograd.set_detect_anomaly(True)
        loss.backward(retain_graph = True)
        optimizer.step() 
        optimizer.zero_grad()
        state = next_state
        if batch_size % 32 == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))

def test(args, model, device, test_loader):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            test_loss += F.nll_loss(output, target, reduction='sum').item() # sum up batch loss
            pred = output.argmax(dim=1, keepdim=True) # get the index of the max log-probability
            correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)

    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))


def main(save_model, epochs):
    # Training settings

    global data_batched_test
    global data_batched
    global memory
    global batch_size
    model = memory()
    # state = model.zero_state(batch_size)
    device = torch.device("cpu")

    optimizer = optim.Adam(model.parameters(), lr = 0.000001)
    print(model.parameters())
    for epoch in range(1, epochs + 1):
        train(model, device, data_batched, optimizer, epoch)
        # test(model, device, data_batched_test)

    if save_model == True:
        torch.save(model.state_dict(),"mnist_cnn.pt")
       
if __name__ == '__main__':
    main(True, 10)()

from funit.

stepanveret avatar stepanveret commented on July 26, 2024 2

torch.sqrt(tensor * tensor.clone().detach()) instead of torch.abs(tensor) worked for me

from funit.

yuzheyao22 avatar yuzheyao22 commented on July 26, 2024 2

I solve my problem by exchanging the position of "loss.backward()" and "optimizer.step()".
This case, I call "optimizer.step()" before "lr_scheduler.step()", which is just opposite to what is suggested by the warning.
I hope this can help you

from funit.

chocolocked avatar chocolocked commented on July 26, 2024 1

This error can be resolved by setting inplace=False in nn.ReLU and nn.LeakyReLU in blocks.py.

However, directly seeting inplace =False will certainly decrease the performance. We can check allocated memory using torch.cuda.memory_allocated(). In same cases, it could be almost double.

Another solution is to use clone(). If someone want to operate tensor like x[0,1,:,:], a good choice is x[0,1,:,:].clone()

Any clearer describle? Where to do clone operation for this issue?

If u could post your code here? it depends on your implementation.

I solved my problem like:

out = self.conv3(out)
out = self.norm3(out)
out = self.rgc({0: out, 1: x[1]})
if self.downsample is not None:
    identity = self.downsample(x[0])
out_x = out[0].clone() + identity
out_x = self.relu(out_x)
out_att = out[1]

thank you so much. solved my problem magically!

from funit.

rogyizac avatar rogyizac commented on July 26, 2024 1

Fixed by adding Added torch.no_grad() on top of model.eval().

I was getting the same error, I was performing inference twice in a single loop on a model set to eval mode,

My train loop looked something like this,

model1.train()
model2.eval()
for (i, batch) in enumerate(train_img_caption_dataloader):
    ...
    recon_x1, mu1, logvar1, mu21, logvar21 = model2(images)
    recon_x2, mu2, logvar2, mu22, logvar22 = model2(recon_x1)

    inputs1 = {'image_emb':mu1, 'text_emb':embeddings.squeeze(1)}
    inputs2 = {'image_emb':mu2, 'text_emb':embeddings.squeeze(1)}

    outputs1 = model1(**inputs1)
    outputs2 = model1(**inputs2)
    # Calculate loss
    loss_value = bceloss(torch.squeeze(outputs1, dim=1), labels.float())
    loss_value += bceloss(torch.squeeze(outputs2, dim=1), 1.0 - labels)
    
    loss_value.backward()
    optimizer.step()

Added torch.no_grad() for model2 inferences, solved the problem,

model1.train()
model2.eval()
for (i, batch) in enumerate(train_img_caption_dataloader):
    ...

    with torch.no_grad():
        recon_x1, mu1, logvar1, mu21, logvar21 = model2(images)
        recon_x2, mu2, logvar2, mu22, logvar22 = model2(recon_x1)

    inputs1 = {'image_emb':mu1, 'text_emb':embeddings.squeeze(1)}
    inputs2 = {'image_emb':mu2, 'text_emb':embeddings.squeeze(1)}

    outputs1 = model1(**inputs1)
    outputs2 = model1(**inputs2)
    # Calculate loss
    loss_value = bceloss(torch.squeeze(outputs1, dim=1), labels.float())
    loss_value += bceloss(torch.squeeze(outputs2, dim=1), 1.0 - labels)
    
    loss_value.backward()
    optimizer.step()

from funit.

uyuutosa avatar uyuutosa commented on July 26, 2024

Hi,
I emerged with the same problem, so I added torch.autograd.set_detect_anomaly(True) and re-try then below error has occurred.

/pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:57: UserWarning: Traceback of forward call that caused the error:
  File "train.py", line 85, in <module>
    opts.multigpus)
  File "/home/yu/proj/FUNIT/trainer.py", line 48, in gen_update
    al, ad, xr, cr, sr, ac = self.model(co_data, cl_data, hp, 'gen_update')
  File "/home/yu/.pyenv/versions/anaconda3-5.2.0/envs/py36_FUNIT/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/yu/proj/FUNIT/funit_model.py", line 40, in forward
    _, xa_gan_feat = self.dis(xa, la)
  File "/home/yu/.pyenv/versions/anaconda3-5.2.0/envs/py36_FUNIT/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/yu/proj/FUNIT/networks.py", line 65, in forward
    feat = self.cnn_f(x)
  File "/home/yu/.pyenv/versions/anaconda3-5.2.0/envs/py36_FUNIT/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/yu/.pyenv/versions/anaconda3-5.2.0/envs/py36_FUNIT/lib/python3.6/site-packages/torch/nn/modules/container.py", line 92, in forward
    input = module(input)
  File "/home/yu/.pyenv/versions/anaconda3-5.2.0/envs/py36_FUNIT/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/yu/proj/FUNIT/blocks.py", line 66, in forward
    x_s = self.conv_s(x) if self.learned_shortcut else x
  File "/home/yu/.pyenv/versions/anaconda3-5.2.0/envs/py36_FUNIT/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/yu/proj/FUNIT/blocks.py", line 165, in forward
    x = self.conv(self.pad(x))
  File "/home/yu/.pyenv/versions/anaconda3-5.2.0/envs/py36_FUNIT/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/yu/.pyenv/versions/anaconda3-5.2.0/envs/py36_FUNIT/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 343, in forward
    return self.conv2d_forward(input, self.weight)
  File "/home/yu/.pyenv/versions/anaconda3-5.2.0/envs/py36_FUNIT/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 340, in conv2d_forward
    self.padding, self.dilation, self.groups)

Elapsed time in update: 1.444831
Traceback (most recent call last):
  File "train.py", line 85, in <module>
    opts.multigpus)
  File "/home/yu/proj/FUNIT/trainer.py", line 48, in gen_update
    al, ad, xr, cr, sr, ac = self.model(co_data, cl_data, hp, 'gen_update')
  File "/home/yu/.pyenv/versions/anaconda3-5.2.0/envs/py36_FUNIT/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/yu/proj/FUNIT/funit_model.py", line 50, in forward
    l_total.backward()
  File "/home/yu/.pyenv/versions/anaconda3-5.2.0/envs/py36_FUNIT/lib/python3.6/site-packages/torch/tensor.py", line 118, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/yu/.pyenv/versions/anaconda3-5.2.0/envs/py36_FUNIT/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [4, 512, 16, 16]], which is output 0 of ConstantPadNdBackward, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

And I found below code emerged the error:

_, xb_gan_feat = self.dis(xb, lb)  
_, xa_gan_feat = self.dis(xa, la)   

Seemed first data input into self.dis is succeeded but the second one is failed.
Do you know how to resolve it?

from funit.

phongnhhn92 avatar phongnhhn92 commented on July 26, 2024

I am also having the same issue.
I have this issue when I try to upgrade to the latest version of Pytorch 1.2

Traceback (most recent call last): File "/home/phong/data/Work/Paper2/Code/FUNIT/train.py", line 83, in <module> d_acc = trainer.dis_update(co_data, cl_data, config) File "/home/phong/data/Work/Paper2/Code/FUNIT/trainer.py", line 62, in dis_update al, lfa, lre, reg, acc = self.model(co_data, cl_data, hp, 'dis_update') File "/home/phong/miniconda3/envs/deeplearning/lib/python3.7/site-packages/torch/nn/modules/module.py", line 547, in __call__ result = self.forward(*input, **kwargs) File "/home/phong/data/Work/Paper2/Code/FUNIT/funit_model.py", line 55, in forward l_real.backward(retain_graph=True) File "/home/phong/miniconda3/envs/deeplearning/lib/python3.7/site-packages/torch/tensor.py", line 118, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/phong/miniconda3/envs/deeplearning/lib/python3.7/site-packages/torch/autograd/__init__.py", line 93, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2, 512, 16, 16]], which is output 0 of ConstantPadNdBackward, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

from funit.

timemao avatar timemao commented on July 26, 2024

This error can be resolved by setting inplace=False in nn.ReLU and nn.LeakyReLU in blocks.py.

However, directly seeting inplace =False will certainly decrease the performance. We can check allocated memory using torch.cuda.memory_allocated(). In same cases, it could be almost double.

Another solution is to use clone(). If someone want to operate tensor like x[0,1,:,:], a good choice is x[0,1,:,:].clone()

Any clearer describle? Where to do clone operation for this issue?

from funit.

1637313 avatar 1637313 commented on July 26, 2024

hi, I am having a similar issue with my code, can you plz assist, see below

class DQN_lstm(nn.Module):
"""
A basic implementation of a Deep Q-Network. The architecture is the same as that described in the
Nature DQN paper.
"""

def __init__(self,
             state_space,
             num_actions,
            lstm_input_size = 4,
            lstm_seq = 30,
            lstm_hidden_size = 128,
            lstm_num_layers = 1):
    """
    Initialise the DQN
    :param observation_space: the state space of the environment
    :param action_space: the action space of the environment
    """
    super(DQN_lstm,self).__init__()
    self.action_space = num_actions
    self.lstm_input_size = lstm_input_size
    self.lstm_seq = lstm_seq
    self.lstm_hidden_size = lstm_hidden_size
    self.lstm_num_layers = lstm_num_layers
    
    self.lstm = nn.LSTM(self.lstm_input_size,
                       self.lstm_hidden_size,
                       self.lstm_num_layers,
                       batch_first=True)#(input,hidden,num_layers)
    self.outputLayer = nn.Linear(self.lstm_hidden_size+3,self.action_space)
    
def forward(self,state,hidden_state,cell_state,dones=None):
    if dones ==None:
        h,(hidden_state,cell_state) = self.lstm(state[0],
                                                (hidden_state,cell_state,))
    else:
        output_list =[]
        for input_state,nd in zip(state[0].unbind(),dones.unbind()):
            #Reset core state to zero wheneveran episode ends
            #Make done broadcastable with (num_layers,batch,hidden_size)
            nd = nd.view(1,-1,1)
            out,(hidden_state,cell_state) = \
            self.lstm(input_state.unsqueeze(0),
                      (nd*hidden_state,nd*cell_state))
            #out -> (batch,seq,hidden_size)
            output_list.append(out)
        h = torch.cat(output_list) # -> (batch,seq,hidden)
    
    a = h[:,-1,:].clone()#(batch,seq,hidden) -> (batch,hidden)

    b = torch.cat((a,state[1]),1)

    x = self.outputLayer(b.float())
    
    return x,hidden_state,cell_state
def init_states(self):
    batch_size =1
    hidden_state = torch.zeros(self.lstm_num_layers,
                              batch_size,
                              self.lstm_hidden_size).to(device)
    cell_state = torch.zeros(self.lstm_num_layers,
                            batch_size,
                            self.lstm_hidden_size).to(device)
    return hidden_state,cell_state

def reset_states(self,hidden_state,cell_state):
    hidden_state[:,:,:] = 0
    cell_state[:,:,:] = 0
    return hidden_state.detach(),cell_state.detach()

class DQNAgent:
def init(
self,
state_space,
num_actions,
replay_buffer: ReplayBuffer,
use_double_dqn,
lr,
batch_size,
gamma,
):

    self.action_space = num_actions
    self.state_space = state_space
    self.replay_buffer = replay_buffer
    self.batch_size = batch_size
    self.gamma = gamma
    
    
    self.dqn = DQN_lstm(state_space,num_actions).to(device)
    
    self.dqn_hidden_state,self.dqn_cell_state = self.dqn.init_states()
    self.target_hidden_state,self.target_cell_state = self.dqn.init_states()
    
    
    self.target = DQN_lstm(state_space,num_actions).to(device)

    self.criterion = nn.MSELoss()
    self.optimizer = optim.Adam(self.dqn.parameters(), lr=lr)

    self.update_target_network()

def optimise_td_loss(self):
    """
    Optimise the TD-error over a single minibatch of transitions
    :return: the loss
    """
    # TODO
    #   Optimise the TD-error over a single minibatch of transitions
    #   Sample the minibatch from the replay-memory
    #   using done (as a float) instead of if statement
    #   return loss

    states,stats, actions, rewards, next_states,stats_primes, dones = self.replay_buffer.sample(self.batch_size)

    states = torch.from_numpy(states).float().to(device)
    stats = torch.from_numpy(stats).float().to(device)
    actions = torch.from_numpy(actions).to(device)
    rewards = torch.from_numpy(rewards).to(device)
    next_states = torch.from_numpy(next_states).float().to(device)
    stats_primes = torch.from_numpy(stats_primes).float().to(device)
    dones = torch.from_numpy(dones).float().to(device)

    tuple_state = (states,stats)
    tuple_state_prime = (next_states,stats_primes)

    prediction,self.dqn_hidden_state,self.dqn_cell_state = \
    self.dqn(tuple_state,self.dqn_hidden_state,self.dqn_cell_state,dones)

    current_q_value = \
    prediction.gather(1,actions.unsqueeze(1))

    target_prediction,self.target_hidden_state,self.target_cell_state = \
    self.target(tuple_state_prime,self.target_hidden_state,
               self.target_cell_state,dones)
    max_q,_ = torch.max(target_prediction,1)

    target_q_value = rewards + (dones*self.gamma*max_q.detach())
    target_q_value = target_q_value.unsqueeze(1)
    
    self.optimizer.zero_grad()

    loss = self.criterion(current_q_value.float(),target_q_value.float())
    loss.backward(retain_graph=True)
    self.optimizer.step()

    return loss.item()

def update_target_network(self):
    """
    Update the target Q-network by copying the weights from the current Q-network
    """
    self.target.load_state_dict(self.dqn.state_dict())

def act(self, state: np.ndarray):
    """
    Select an action greedily from the Q-network given the state
    :param state: the current state
    :return: the action to take
    """
    state = (torch.from_numpy(state[0]).unsqueeze(0).float().to(device),
             torch.from_numpy(state[1]).unsqueeze(0).float().to(device))
    with torch.no_grad():
        outputs,self.dqn_hidden_state,self.dqn_cell_state =\
        self.dqn(state,self.dqn_hidden_state,self.dqn_cell_state)
        action = torch.argmax(outputs)
    return action.item()

from funit.

mingyuliutw avatar mingyuliutw commented on July 26, 2024

FYI

We have a cleaner and better implementation of FUNIT in
https://github.com/NVlabs/imaginaire

Due to our limited resources, we will likely support Imaginaire better in the future.

from funit.

seoyeon-p avatar seoyeon-p commented on July 26, 2024

I just did a target_value = target_value.detach(). Error is gone

Doesn't it make training time longer than before? Error is gone, but now training time is much slower than before. Can you share your experience with that?

from funit.

nocoolsandwich avatar nocoolsandwich commented on July 26, 2024

I just did a target_value = target_value.detach(). Error is gone

.detch will cut you backward

from funit.

ParnaChowdhury avatar ParnaChowdhury commented on July 26, 2024

facing the same issue

from funit.

rsdorighello avatar rsdorighello commented on July 26, 2024

I'm having the same issue but can't solve it "one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [64, 3, 3, 3]] is at version 3; expected version 2 instead."
I keep getting this error, here's the code, can anyone help me?
I feel that is in the "divide_conv" function, I'm trying to replicate this code from an article

def adjust_learning_rate(optimizer, lr):
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

def make_layers(cfg, batch_norm=False):
    layers = []
    extra_layers_encode = []
    in_channels = 3
    for v in cfg:
        if v == 'M':
            layers += [nn.MaxPool2d(kernel_size=2, stride=2)]
        else:
            conv2d = nn.Conv2d(in_channels, v, kernel_size=3, padding=1)
            layers += [conv2d, nn.ReLU()]
            extra_layers_encode += [nn.Linear(in_channels, 256), nn.ReLU()]
            in_channels = v
    layers = nn.ModuleList(layers)
    extra_layers_encode = nn.ModuleList(extra_layers_encode)
    return layers, extra_layers_encode
class pruning_model(nn.Module):
    def __init__(self, extra_layers_encode):
        super(pruning_model, self).__init__()
        self.extra_layers_encode = extra_layers_encode
        self.rnncell = nn.GRUCell(256, 4, bias=True)
        self._initialize_weights()
        self.relu = nn.ReLU(inplace=True)

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
                m.weight.data.normal_(0, math.sqrt(2. / n))
                if m.bias is not None:
                    m.bias.data.zero_()
            elif isinstance(m, nn.BatchNorm2d):
                m.weight.data.fill_(1)
                m.bias.data.zero_()
            elif isinstance(m, nn.Linear):
                m.weight.data.normal_(0, 0.01)
                m.bias.data.zero_()

    def forward(self, x, h, ct):
    	encode = self.extra_layers_encode[2*ct]
    	x = encode(x)
    	return self.rnncell(x, h)
class vgg_RPN(nn.Module):
    def __init__(self, conv_layers,extra_layers_encode, num_classes=1000):
        super(vgg_RPN, self).__init__()
        self.conv_layers = conv_layers
        self.extra_layers_encode = extra_layers_encode
        self.classifier = nn.Sequential(
            nn.Linear(512, 512),
            nn.ReLU(inplace = True),
            nn.Dropout(p=0.5),
            nn.Linear(512, num_classes),
        )
        self.rnncell = nn.GRUCell(256, 4, bias=True)
        self._initialize_weights()
        # mode = 0: VGG baseline
        # mode = 1: random pruning
        # mode = 2: RNP training
        self.mode = 0
        self.group = []
        self.greedyP = 0.9

    def divide_conv(self):
    	for layer in self.conv_layers:
    		if isinstance(layer, nn.Conv2d):
    			weights = layer.weight.data
    			weights = weights.view(weights.size(0), weights.size(1), -1)
    			norm = torch.norm(weights,2,2).cpu().numpy()
    			norm = np.mean(norm, 1)
    			order = np.argsort(norm)
    			glen = int(order.shape[0]/4)
          #glen = round(glen)
    			print(glen, order.shape[0])
    			g0 = torch.from_numpy(np.sort(order[3*glen:]))
    			g1 = torch.from_numpy(np.sort(np.hstack((order[3*glen:], order[2*glen:3*glen]))))
    			g2 = torch.from_numpy(np.sort(np.hstack((order[3*glen:], order[2*glen:3*glen], order[glen:2*glen]))))
    			g3 = torch.from_numpy(np.sort(np.hstack((order[3*glen:], order[2*glen:3*glen], order[glen:2*glen], order[0:glen]))))
    			self.group += [g0, g1, g2, g3]

    def forward(self, x):
    	if self.mode == 0:
    		for layer in self.conv_layers:
    			#print layer
    			x = layer(x)
    		x = x.view(x.size(0), -1)
    		x = self.classifier(x)

    	if self.mode == 1:
    		ct = 0
    		bs = x.size(0)
    		for layer in self.conv_layers:
    			if isinstance(layer, nn.Conv2d) and ct > 0:
    				#choice = random.randint(0, 3)
    				#now_group = self.group[ct][choice]
    				#x = F.conv2d(x, layer.weight[former_group, now_group, :, :], layer.bias[now_group], kernel_size=3, padding=1)
    				x = layer(x)
    				mask = torch.zeros(x.size())
    				for i in range(bs):
    					choice = random.randint(0, 3)
    					now_group = self.group[ct][choice]
    					mask[i][now_group] = 1.0
    					#print mask.sum()
    				mask = Variable(mask, requires_grad=False).cuda()
    				x = mask*x
    				ct += 1
    			elif isinstance(layer, nn.Conv2d) and ct == 0:
    				x = layer(x)
    				ct += 1
    			else:
    				x = layer(x)
    		x = x.view(x.size(0), -1)
    		x = self.classifier(x)

    	if self.mode == 2:
    		y = []
    		ct = 0
    		bs = x.size(0)
    		former_state = Variable(torch.zeros(bs, 4)).cuda()
    		for layer in conv_layers:
    			if isinstance(layer, nn.Conv2d) and ct > 0:
    				#choice = random.randint(0, 3)
    				#now_group = self.group[ct][choice]
    				#x = F.conv2d(x, layer.weight[former_group, now_group, :, :], layer.bias[now_group], kernel_size=3, padding=1)
    				x_pool = x.mean(3).mean(2)
    				x = layer(x)
    				mask = torch.zeros(x.size())
    				h = pnet(x_pool, former_state, ct)
    				#x_input = self.extra_layers_encode[ct](x_pool)
    				#h = self.rnncell(x_input, former_state)
    				former_state = h
    				h_softmax_np = h.data.cpu().numpy()
    				choices = np.zeros((bs,), int)
    				for i in range(bs):
    					choice = np.argmax(h_softmax_np[i])
    					if random.random() > self.greedyP:
    						choice = random.randint(0, 3)
    					choices[i] = choice    						
    					now_group = self.group[ct][choice]
    					mask[i][now_group] = 1.0
    				mask = Variable(mask, requires_grad=False).cuda()
    				x = mask*x
    				y += [[h, choices]]
    				ct += 1

    			elif isinstance(layer, nn.Conv2d) and ct == 0:
    				x = layer(x)
    				ct += 1
    			else:
    				x = layer(x)
    		x = x.view(x.size(0), -1)
    		x = self.classifier(x)
    		return x, y
    	return x

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
                m.weight.data.normal_(0, math.sqrt(2. / n))
                if m.bias is not None:
                    m.bias.data.zero_()
            elif isinstance(m, nn.BatchNorm2d):
                m.weight.data.fill_(1)
                m.bias.data.zero_()
            elif isinstance(m, nn.Linear):
                m.weight.data.normal_(0, 0.01)
                m.bias.data.zero_()
        for m in self.conv_layers:
            if isinstance(m, nn.Conv2d):
                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
                m.weight.data.normal_(0, math.sqrt(2. / n))
                if m.bias is not None:
                    m.bias.data.zero_()
cfg = [64, 64, 'M', 128, 128, 'M', 256, 256, 256, 'M', 512, 512, 512, 'M', 512, 512, 512, 'M']

conv_layers, extra_layers_encode = make_layers(cfg)
net = vgg_RPN(conv_layers, extra_layers_encode, 100)
net.cuda()

torch.autograd.set_detect_anomaly(True)

pnet = pruning_model(extra_layers_encode)
pnet.cuda()

transform_train = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

trainset = torchvision.datasets.CIFAR100(root='./data', train=True, download=True, transform=transform_train)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=128, shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR100(root='./data', train=False, download=True, transform=transform_test)
testloader = torch.utils.data.DataLoader(testset, batch_size=100, shuffle=False, num_workers=2)

criterion = nn.CrossEntropyLoss()
criterion_rl = nn.MSELoss()
optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9, weight_decay=5e-5)
optimizer_rl = optim.Adam(pnet.parameters(), lr=0.0001, weight_decay=5e-5)
```
```
def train(epoch):
    print('\nEpoch: %d' % epoch)
    net.train()
    train_loss = 0
    raw_loss = 0
    correct = 0
    total = 0
    for batch_idx, (inputs, targets) in enumerate(trainloader):
        inputs, targets = inputs.cuda(), targets.cuda()
        optimizer.zero_grad()
        inputs, targets = Variable(inputs), Variable(targets)
        outputs, y  = net(inputs)
        loss = criterion(outputs, targets)
        raw_loss += loss.item()
        loss.backward(retain_graph=True)
        optimizer.step()

        for i in range(len(y)):
            optimizer_rl.zero_grad()
            action = y[i][1]
            print(action.shape)
            print(y[i][0].size())
            ind = Variable(torch.from_numpy(np.expand_dims(action,1)).cuda())
            state_action_values = y[i][0].gather(1,ind)
            if i < len(y) - 1:
                next_q = y[i+1][0].data.cpu().numpy()
                rtargets = -action*0.1 +  np.max(next_q, 1)
            else:
                rtargets = - action*0.1 - raw_loss

            rtargets = Variable(torch.from_numpy(rtargets.astype(np.float32)).cuda())
            loss_rl = criterion_rl(state_action_values, rtargets)
            loss_rl.backward(retain_graph=True)
            optimizer_rl.step()

        train_loss += loss.item()
        _, predicted = torch.max(outputs.data, 1)
        total += targets.size(0)
        correct += predicted.eq(targets.data).cpu().sum()

       # progress_bar(batch_idx, len(trainloader), 'Loss: %.3f Row_loss: %.3f | Acc: %.3f%% (%d/%d)'
           # % (train_loss/(batch_idx+1), raw_loss/(batch_idx+1), 100.*correct/total, correct, total))
        print('Loss: %.3f | Acc: %.3f%% (%d/%d)'
            % (train_loss/(batch_idx+1), 100.*correct/total, correct, total))

def test(epoch):
    global best_acc
    net.eval()
    test_loss = 0
    correct = 0
    total = 0
    for batch_idx, (inputs, targets) in enumerate(testloader):
        inputs, targets = inputs.cuda(), targets.cuda()
        inputs, targets = Variable(inputs, volatile=True), Variable(targets)
        outputs, y = net(inputs)
        loss = criterion(outputs, targets)

        test_loss += loss.item()
        _, predicted = torch.max(outputs.data, 1)
        total += targets.size(0)
        correct += predicted.eq(targets.data).cpu().sum()

        #progress_bar(batch_idx, len(testloader), 'Loss: %.3f  | Acc: %.3f%% (%d/%d)'
            #% (test_loss/(batch_idx+1), 100.*correct/total, correct, total))
        print('Loss: %.3f | Acc: %.3f%% (%d/%d)'
            % (test_loss/(batch_idx+1), 100.*correct/total, correct, total))

```

from funit.

younisahmad avatar younisahmad commented on July 26, 2024

LeakyReLU

how to use clone() in this

Training Loop

Lists to keep track of progress

G_losses = []
D_losses = []
E_losses = []
iters = 0
num_epochs = 2

print("Starting Training Loop...")

For each epoch

for epoch in range(num_epochs):
# For each batch in the dataloader
for i, (images) in enumerate(dataloader, 0):
netG.train()
netD.train()
netE.train()

    netD.zero_grad()
    
    images = images.to(device)
    fake_images = netG(netE(images))
    ############################
    # (1) Update D network: maximize log(D(x)) + log(1 - D(G(z)))
    ###########################
    
    ## Create a fake pair batch --

    inp_x = {}
    inp_x['img']=images
    inp_x['encoded'] = netE(images)

label = torch.full((images.size(0),), real_label, device=device)

    label = torch.FloatTensor(np.random.uniform(low=0.855, high=0.999, size=(images.size(0)))).to(device)
    output = netD(inp_x).view(-1)
    errD_real = criterion(output, label.clone())
    errD_real.backward(retain_graph=True)
    D_x = output.mean().item()
    
    inp_x_fake = {}
    inp_x_fake['img']=fake_images
    inp_x_fake['encoded'] = netE(images)
    label = torch.FloatTensor(np.random.uniform(low=0.005, high=0.155, size=(images.size(0)))).to(device)

label.fill_(fake_label)

    output = netD(inp_x_fake).view(-1)
    errD_fake = criterion(output, label.clone())
    errD_fake.backward(retain_graph=True)
    D_G_z1 = output.mean().item()
    
    errD = errD_real + errD_fake
    
    optimizerD.step()

    ############################
    # (2) Update G network: maximize log(D(G(z)))
    ###########################
    netG.zero_grad()
    inp_x_fake = {}
    inp_x_fake['img']=fake_images
    inp_x_fake['encoded'] = netE(images)
    
    label = torch.FloatTensor(np.random.uniform(low=0.895, high=0.999, size=(images.size(0)))).to(device)

label.fill_(real_label)

    output = netD(inp_x_fake).view(-1)
    
    errG = criterion(output, label.clone()) + 2*l1criterion(images,fake_images)
    errG.backward(retain_graph=True)
    D_G_z2 = output.mean().item()
    optimizerG.step()

    
    netE.zero_grad()
    inp_x_fake = {}
    inp_x_fake['img']=fake_images
    inp_x_fake['encoded'] = netE(images)
    
    label = torch.FloatTensor(np.random.uniform(low=0.895, high=0.999, size=(images.size(0)))).to(device)
    output = netD(inp_x_fake).view(-1)

    errE = criterion(output, label.clone()) + 2*l1criterion(images,fake_images)
    errE.backward(retain_graph=True)
    E_G_z2 = output.mean().item()
    optimizerE.step()
    
    #################################_______STATS________###########################################
    # Output training stats
    if i % 50 == 0:
        print('[%d/%d][%d/%d]\tLoss_D: %.4f\tLoss_G: %.4f\tLoss_E: %.4f\tD(x): %.4f\tD(G(z)): %.4f / %.4f'
              % (epoch, num_epochs, i, len(dataloader),
                 errD.item(), errG.item(),errE.item(), D_x, D_G_z1, D_G_z2))

    # Save Losses for plotting later
    G_losses.append(errG.item())
    D_losses.append(errD.item())
    E_losses.append(errE.item())
    
    # Check how the generator is doing by saving G's output on fixed_noise

if (iters % 50 == 0) or ((epoch == num_epochs-1) and (i == len(dataloader)-1)):

netG.eval()

with torch.no_grad():

fake = netG(fixed_noise).detach().cpu()

fake[:] = fake[:]*0.5 + 0.5

img_list.append(vutils.make_grid(fake, padding=2, normalize=True))

    del images
    del inp_x_fake
    del inp_x
    del label
    del output
    torch.cuda.empty_cache()
    iters += 1
    
    
    
    if i%500 ==0:
        netE.eval()
        netG.eval()
        encoded_img = netE(valid_batch)
        reconstructed_img = netG(encoded_img)
        f, axarr = plt.subplots(num_images_to_show,2)
        for i in range(num_images_to_show):
            validimg = (valid_batch[i].cpu().detach().permute(1, 2, 0) * 0.5) + 0.5
            rec_img = (reconstructed_img[i].cpu().detach().permute(1, 2, 0) *0.5 ) + 0.5
            axarr[0].imshow(validimg)
            axarr[1].imshow(rec_img)
            f.set_figheight(20)
            f.set_figwidth(20)
        plt.show()

from funit.

aalu-love avatar aalu-love commented on July 26, 2024

I'm getting the same error on loss.backward just help me with that specific line

for hyperpam_test_id, (dropout, lr,batch_size, shuffle) in enumerate(product(*param_val)):
print("Hyperparameter Test ID:", hyperpam_test_id + 1)
model = Net().to(device)
train_set = torch.utils.data.DataLoader(train, batch_size = batch_size, shuffle = shuffle)
optimizer = optim.Adam(model.parameters(), lr= lr)
criterion = torch.nn.CrossEntropyLoss()
comment = f'dropout = {dropout} batch_size = {batch_size} lr = {lr} shuffle = {shuffle}'
writer = SummaryWriter(comment=comment)
for epoch in range(10):
total_loss = 0
total_correct = 0
for images, labels in train_set:
images, labels = images.to(device), labels.to(device)
preds = model(images)

        loss = criterion(preds, labels)
        total_loss+= loss.item()
        total_correct+= get_num_correct(preds, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    writer.add_scalar("Loss", total_loss, epoch)
    writer.add_scalar("Correct", total_correct, epoch)
    writer.add_scalar("Accuracy", total_correct/ len(train_set), epoch)

    print("dropout:",dropout, "batch_size:",batch_size, "lr:",lr,"shuffle:",shuffle)
    print("epoch:", epoch, "total_correct:", total_correct, "loss:",total_loss)
print("___________________________________________________________________")

writer.add_hparams(
        {"dropout" : dropout, "lr": lr, "batch_size": batch_size, "shuffle":shuffle},
        {
            "accuracy": total_correct/ len(train_set),
            "loss": total_loss,
        },
    )

writer.close()

from funit.

LikeGiver avatar LikeGiver commented on July 26, 2024

In my situation, this error is caused by the function"load_state_dict", I rewrite it like
"self.state_dict()["WEIGHTNAME"] = UPDATED_WEIGHT" then the problem is solved.
It seems that the detach operation in"load_state_dict"caused this problem but I'm not very sure.

from funit.

LysSanzMoreta avatar LysSanzMoreta commented on July 26, 2024

I was trying to modify a tensor using a boolean mask inplace a tensor :
out[mask] = 0

so I just transformed the mask to integers and did element wise multiplication

out = out*mask.int()

from funit.

sammilei avatar sammilei commented on July 26, 2024

My solution:

loss = loss.requires_grad_() # added thsi line because loss is not grad-able.
loss.backward()

CUDA Version: 11.2

from funit.

simasoltanpour avatar simasoltanpour commented on July 26, 2024

I am having the same issue.
Traceback (most recent call last):
File "model5.py", line 543, in
wgan.train()
File "model5.py", line 272, in train
loss = self._train_generator(free_img, noised_img)
File "model5.py", line 346, in _train_generator
g_loss.backward()
File "/home/soltanps/py38-cc/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/home/soltanps/py38-cc/lib/python3.8/site-packages/torch/autograd/init.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2, 1, 6, 32, 32]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck! Can anyone help me with that? I copied my code here.

#3DGAN

def train(self):
self.dataset = MRIDataset(self.level)
# self.dataset = MRIDataset10125()
self.dataloader = DataLoader(
self.dataset, batch_size=self.batch_size, shuffle=True)
self.validDataset = MRIValidDataset(self.level)
self.validDataloader = DataLoader(
self.validDataset, batch_size=self.batch_size, shuffle=True)
#Training times
for epoch in range(0, self.epochs):
self.test(epoch)
#iterate dataset
for batch_index, batch in enumerate(self.dataloader):
# if (batch_index % 100 == 0):
# self.test(epoch)
# print("epoch:", epoch, ";batch number:", batch_index, ";D_Loss:", end="")
free_img = batch["free_img"]
noised_img = batch["noised_img"]
# print(type(noised_img))

            # training discriminator
            for iter_i in range(self.d_iter):
                loss = self._train_discriminator(free_img, noised_img)
                print("\tVGG_MSE - lr: %.10f, Level: %d, Epoch: %d, bath_index: %d, iter: %d, G-Loss: " %
                      (self.lr, self.level, epoch, batch_index, iter_i), loss)

            # training generator
            loss = self._train_generator(free_img, noised_img) #line272
            # print("G Loss:%.4f, %.4f" %
            #    (float(loss[0]), float(loss[1])))

            # Save the model and loss value
            if batch_index % 100 == 0:
                self.save_model()


        if ((epoch + 1) % 4 == 0 and self.lr > 1e-7):
            self.G_optimizer.defaults["lr"] *= 0.5
            self.G_optimizer.defaults["lr"] *= 0.5
            self.lr *= 0.5

def _train_discriminator(self, free_img, noised_img, train=True):
    self.D_optimizer.zero_grad()

    z = Variable(noised_img)
    real_img = Variable(free_img / 4096)
    if self.gpu:
        z = z.cuda()
        real_img = real_img.cuda()

    fake_img = self.generator(z)
    real_validity = self.discriminator(real_img)
    fake_validity = self.discriminator(fake_img.data / 4096)
    gradient_penalty = self._calc_gradient_penalty(
        real_img.data, fake_img.data)

    d_loss = torch.mean(-real_validity) + torch.mean(fake_validity) + \
        self.lambda_gp * gradient_penalty
    if train:
        d_loss.backward()
        # torch.mean(-real_validity).backward()
        # (torch.mean(-real_validity) + torch.mean(fake_validity)).backward()
        # torch.mean(-real_validity).backward()
        # torch.mean(fake_validity).backward()
        self.D_optimizer.step()

    return d_loss.data.item(), torch.mean(-real_validity).cpu().item(), torch.mean(fake_validity).cpu().item(), self.lambda_gp * gradient_penalty.cpu().item()

def _train_generator(self, free_img, noised_img, train=True):
    z = Variable(noised_img)
    real_img = Variable(free_img, requires_grad=False)


    if self.gpu:
        z = z.cuda()
        real_img = real_img.cuda()

    self.G_optimizer.zero_grad()
    self.D_optimizer.zero_grad()
    self.vgg19.zero_grad()

    criterion_mse = nn.MSELoss()
    criterion_vgg= nn.MSELoss()

    fake_img = self.generator(z)
    mse_loss = criterion_mse(fake_img, real_img)
    if train:
        (self.lambda_mse * mse_loss).backward(retain_graph=True)


    feature_fake_vgg = self.vgg19(fake_img)
    feature_real_vgg = Variable(self.vgg19(real_img).data, requires_grad=False).cuda()

    vgg_loss = criterion_vgg(feature_fake_vgg, feature_real_vgg)

    fake_validity = self.discriminator(fake_img / 4096)
    # g_loss = self.lambda_mse * mse_loss + self.lambda_vgg * vgg_loss + self.lambda_d * torch.mean(-fake_validity)
    g_loss =  self.lambda_vgg * vgg_loss + self.lambda_d * torch.mean(-fake_validity)            

    if train:
        # (self.lambda_mse * mse_loss).backward()
        g_loss.backward()  #line346 error
        self.G_optimizer.step()
   return g_loss.data.item(), mse_loss.data.item(), torch.mean(-fake_validity).data.item(), vgg_loss.data.item()

from funit.

Choapinus avatar Choapinus commented on July 26, 2024

This error can be resolved by setting inplace=False in nn.ReLU and nn.LeakyReLU in blocks.py.

However, directly seeting inplace =False will certainly decrease the performance. We can check allocated memory using torch.cuda.memory_allocated(). In same cases, it could be almost double.

Another solution is to use clone(). If someone want to operate tensor like x[0,1,:,:], a good choice is x[0,1,:,:].clone()

Any clearer describle? Where to do clone operation for this issue?

If u could post your code here? it depends on your implementation.

I solved my problem like:

out = self.conv3(out)
out = self.norm3(out)
out = self.rgc({0: out, 1: x[1]})
if self.downsample is not None:
    identity = self.downsample(x[0])
out_x = out[0].clone() + identity
out_x = self.relu(out_x)
out_att = out[1]

I can't believe that clone() have solved my problem haha, +1 and fav response <3

from funit.

maxiuw avatar maxiuw commented on July 26, 2024

similar thing happend to me -> solution as other mentioned was (suprisingly) setting 2x more cpu memory in a sbatch file

from funit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.