Coder Social home page Coder Social logo

Finetuning question about recognize-anything HOT 7 OPEN

adbmdp avatar adbmdp commented on July 17, 2024
Finetuning question

from recognize-anything.

Comments (7)

xinyu1205 avatar xinyu1205 commented on July 17, 2024 1

It means you need to modify the forward function of ram.py or ram_plus.py.
And I strongly recommend that you read the RAM or RAM++paper before completing these tasks.

from recognize-anything.

xinyu1205 avatar xinyu1205 commented on July 17, 2024

Thanks for your attention.
Actually, this is certainly feasible. The performance of the model depends on the quality of your finetune dataset.

from recognize-anything.

adbmdp avatar adbmdp commented on July 17, 2024

Thanks for your reply and your awesome work @xinyu1205 !!

OK let's say I want to train the model with a celebrity dataset.

I have trouble understanding which tag file I need to update with the new tags.
To my understandings:
parse_label_id refers to the tag indices present in ram/data/tag_list.txt
union_label_id refers to the tag indices present in ram/data/ram_tag_list.txt

But for example when I watch in the COCO dataset for example can be found:

{
  "image_path":"coco/val2014/COCO_val2014_000000522418.jpg",
  "parse_label_id":[
    [
      4480,
      4532,
      678
    ]
  ],
  "caption":[
    "there is a woman that is cutting a white cake"
  ],
  "union_label_id":[
    4480,
    2624,
    2051,
    678,
    2599,
    2577,
    4532,
    1238,
    215,
    2332,
    4439
  ]
}

parse_label_id I should find the id in the ram/data/tag_list.txt file right?
This file only has 3429 IDs and I see an id 4480 !

So to summarize. If I want to modify only the tagging part of RAM++. In which file should I add my tags (maybe just one)?
And my Dataset can be something like:

{
  "image_path":"datasets/celebrities/CELEB_00001.jpg",
  "parse_label_id":[
    [
      9999
    ]
  ],
  "caption":[
    "Michael Jordan"
  ],
  "union_label_id":[
     8888
  ]
}

from recognize-anything.

xinyu1205 avatar xinyu1205 commented on July 17, 2024

parse_label_id refers to the tag parsed from image caption
union_label_id refers to the full tags of the image
Therefore, if you only have image-tag dataset, you just need set image tags as union_label_id.
And you only need the loss_tag and loss_dis in RAM or RAM++

from recognize-anything.

adbmdp avatar adbmdp commented on July 17, 2024

Ok so I just need:

{
  "image_path":"datasets/celebrities/CELEB_00001.jpg",
  "caption":[
    "Michael Jordan"
  ],
  "union_label_id":[
     new id from ram/data/ram_tag_list.txt
  ]
}

And you only need the loss_tag and loss_dis in RAM or RAM++

I don't know what you mean here but i'll try to find out. Do I have to change some code in finetune.py?

Thanks again for taking from you time to reply 👍 🥇

from recognize-anything.

adbmdp avatar adbmdp commented on July 17, 2024

Thanks. I'll do that.

from recognize-anything.

adbmdp avatar adbmdp commented on July 17, 2024

So i'm trying to fine-tune the model on just one tag as a test (on my CPU).
I've add a new tag in recognize-anything/ram/data/ram_tag_list.txt so now there is 4586 lines in this file.

I've modified the forward function:

def forward(self, image, caption, image_tag, clip_feature, batch_text_embed):
        image_embeds = self.image_proj(self.visual_encoder(image))
        image_atts = torch.ones(image_embeds.size()[:-1],
                                dtype=torch.long).to(image.device)
    
        ##================= Distillation from CLIP ================##
        image_cls_embeds = image_embeds[:, 0, :]
        image_spatial_embeds = image_embeds[:, 1:, :]
    
        loss_dis = F.l1_loss(image_cls_embeds, clip_feature)
    
        ###===========multi tag des reweight==============###
        bs = image_embeds.shape[0]
    
        des_per_class = int(self.label_embed.shape[0] / self.num_class)
    
        image_cls_embeds = image_cls_embeds / image_cls_embeds.norm(dim=-1, keepdim=True)
        reweight_scale = self.reweight_scale.exp()
        logits_per_image = (reweight_scale * image_cls_embeds @ self.label_embed.t())
        logits_per_image = logits_per_image.view(bs, -1, des_per_class)
    
        weight_normalized = F.softmax(logits_per_image, dim=2)
        label_embed_reweight = torch.empty(bs, self.num_class, 512).to(image.device).to(image.dtype)
    
        for i in range(bs):
            reshaped_value = self.label_embed.view(-1, des_per_class, 512)
            product = weight_normalized[i].unsqueeze(-1) * reshaped_value
            label_embed_reweight[i] = product.sum(dim=1)
    
        label_embed = torch.nn.functional.relu(self.wordvec_proj(label_embed_reweight))
    
        ##================= Image Tagging ================##
    
        tagging_embed = self.tagging_head(
            encoder_embeds=label_embed,
            encoder_hidden_states=image_embeds,
            encoder_attention_mask=image_atts,
            return_dict=False,
            mode='tagging',
        )
    
        logits = self.fc(tagging_embed[0]).squeeze(-1)
    
        loss_tag = self.tagging_loss_function(logits, image_tag)
    
        # Ignorez la perte d'alignement texte-image
        loss_alignment = None
    
        # Renvoyez les pertes loss_tag et loss_dis
        return loss_tag, loss_dis

Here is my finetune.yaml file :

train_file: [
            'outputs/data.json',
             ]
image_path_root: ""

# size of vit model; base or large
vit: 'swin_l'
vit_grad_ckpt: False
vit_ckpt_layer: 0

image_size: 384
batch_size: 26

# optimizer
weight_decay: 0.05
init_lr: 5e-06
min_lr: 0
max_epoch: 2
warmup_steps: 3000

class_num: 4586

I lauch the fine tuning like this:
python3 finetune.py --model-type ram_plus --config ram/configs/finetune.yaml --checkpoint outputs/ram_plus/ram_plus_swin_large_14m.pth --output-dir outputs/ram_plus_ft --device cpu

RuntimeError: Error(s) in loading state_dict for RAM_plus:
	size mismatch for label_embed: copying a param with shape torch.Size([233835, 512]) from checkpoint, the shape in current model is torch.Size([233886, 512]).

I think the error message indicates that there is a size mismatch between the pre-trained model's label_embed layer and the current model's label_embed layer. This is likely due to a difference in the number of tags or classes between the pre-trained model and the current model. But I have no clue how to resolve this.

Thanks!

from recognize-anything.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.