Hey guys. Great work and thank you for publishing the code. I have a

About GPU utilization about lora HOT 1 CLOSED

TemugeB commented on May 23, 2024

About GPU utilization

from lora.

Comments (1)

TemugeB commented on May 23, 2024

After some more investigation, not specifically creating dW seems to work.

The LoRA forward pass is modified to accept low rank matrices directly. Then, dW is not specifically created and the GPU memory utilization goes down to 2000MB.

    def forward_lora(self, x, dWa, dWb):
        B, S, E = x.shape
        
        #calculate frozen model output
        qkv = self.qkv(x).reshape(B, 3, S, E).permute(1, 0, 2, 3)

        #calculate LoRA adaption
        dqkv = (x@dWb.T @ dWa.T).reshape(B, 3, S, E).permute(1, 0, 2, 3)
        #add as in equation (3) in paper
        qkv = qkv + dqkv
        q, k, v = qkv.unbind(0)
        
        attn = q @ k.transpose(-2, -1)
        x = attn @ v
        return x

I tried to do this using pytorch parameterization. But this seems to copy the original weight tensor:

import torch.nn.utils.parametrize as parametrize

class LoRA_Layer_Mod(nn.Module):

    def __init__(self, embed_dim, rank, device):
        super().__init__()

        self.embed_dim = embed_dim
        self.dWa = nn.Parameter(torch.normal(0, 1, (3*embed_dim, rank))/sqrt(rank)).to(device)
        self.dWb = nn.Parameter(torch.normal(0, 1, (rank, embed_dim))/sqrt(rank)).to(device)

    def forward(self, qkv):
        return qkv + self.dWa @ self.dWb


basemodel = BaseModel(embed).to('cuda:0')
basemodel.requires_grad_(False)
x = torch.ones((1, seq, embed)).to('cuda:0')

lora_layer = LoRA_Layer_Mod(embed_dim=embed, rank=2, device='cuda:0')
parametrize.register_parametrization(basemodel.qkv, "weight", lora_layer)
out = basemodel(x)
out = torch.sum(out)
out.backward()
input('check nvidia-smi for GPU utilization')