zhoudaquan / dvit_repo Goto Github PK

View Code? Open in Web Editor NEW

135.0 135.0 23.0 22.29 MB

License: MIT License

Python 99.88% Shell 0.12%

dvit_repo's People

Contributors

Stargazers

Watchers

dvit_repo's Issues

where is “distributed_train.sh” in the eval.sh in the project

where is “distributed_train.sh” in the eval.sh in the project，please?

extracting tar models

how should we extract the .tar pretrained models

Were reported performances of VIT pretrained like google orginal paper?

cosine similarity of different attention maps

Hi!
I think calculate the similarity of different attention maps to explain the influence of transformer depth is a good idea. Can you provide a clean code for calculating cosine similarity?

Hello, I notice one figure (Fig.5) in your paper like this.

Would you please tell me the meaning of the thick blue vertical line? Or, how to get the conclusion: "In the deep blocks, the MHSA learns nearly uniform global attention maps with high similarity."
Respect.

No benefit for Deit-S

When I applied re-attention in Deit-S (https://github.com/facebookresearch/deit), no accuracy gain was observed. Could you give some advice?

Attention map visualization

I notice that you visualize the attention map of selected blocks(in Fig. 6), can you show the code for drawing that?

About the giving code and model of attention map visualization

https://drive.google.com/drive/folders/1_lxspG_nzPstxDWhKQqPWhYZlB6zPMGs?usp=sharing
Hi, Daquan! I tried the code and .pth.tar file you provided above. However, I got the output visualization for layer 1 like this.

The key to the model I used was "blocks.{layer_index}.attn.qkv.weight". Can you give me some advice about this? Appreciate that!

class ReAttention(nn.Module):
    """
    It is observed that similarity along same batch of data is extremely large. 
    Thus can reduce the bs dimension when calculating the attention map.
    """
    def __init__(self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0., proj_drop=0.,expansion_ratio = 3, apply_transform=True, transform_scale=False):
        super().__init__()
        self.num_heads = num_heads
        head_dim = dim // num_heads
        self.apply_transform = apply_transform
        
        # NOTE scale factor was wrong in my original version, can set manually to be compat with prev weights
        self.scale = qk_scale or head_dim ** -0.5
        if apply_transform:
            self.reatten_matrix = nn.Conv2d(self.num_heads,self.num_heads, 1, 1)
            self.var_norm = nn.BatchNorm2d(self.num_heads)
            self.qkv = nn.Linear(dim, dim * expansion_ratio, bias=qkv_bias)
            self.reatten_scale = self.scale if transform_scale else 1.0
        else:
            self.qkv = nn.Linear(dim, dim * expansion_ratio, bias=qkv_bias)
        
        self.attn_drop = nn.Dropout(attn_drop)
        self.proj = nn.Linear(dim, dim)
        self.proj_drop = nn.Dropout(proj_drop)
    def forward(self, x, **atten**=None):
        B, N, C = x.shape
        # x = self.fc(x)
        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
        q, k, v = qkv[0], qkv[1], qkv[2]   # make torchscript happy (cannot use tensor as tuple)

        attn = (q @ k.transpose(-2, -1)) * self.scale
        attn = attn.softmax(dim=-1)
        attn = self.attn_drop(attn)
        if self.apply_transform:
            attn = self.var_norm(self.reatten_matrix(attn)) * self.reatten_scale
        attn_next = attn
        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
        x = self.proj(x)
        x = self.proj_drop(x)
        return x, attn_next

What's the Norm function?

What's the Norm function in Eq.(3)? LayerNorm?

training script

Could you add the script / command you used for training?

zhoudaquan / dvit_repo Goto Github PK

dvit_repo's People

Contributors

Stargazers

Watchers

Forkers

dvit_repo's Issues

Recommend Projects

Recommend Topics

Recommend Org