Coder Social home page Coder Social logo

The diffrence with BLIP2? about minigpt-4 HOT 6 CLOSED

vision-cair avatar vision-cair commented on August 21, 2024 5
The diffrence with BLIP2?

from minigpt-4.

Comments (6)

TsuTikgiau avatar TsuTikgiau commented on August 21, 2024 11

The main difference between MiniGPT-4 and BLIP-2 is the training strategy. We notice that BLIP-2's training strategy is not enough to align the vision module with powerful LLMs like Vicuna well and will impact the text generation ability of Vicuna seriously. Therefore, we propose a novel way to collect a small yet high-quality image-description pair dataset created by the model itself and polished by ChatGPT. After the traditional image-text training stage like BLIP-2 did, we further fineturn MiniGPT-4 on this dataset with conversation prompts together so MiniGPT-4 can generate coherent text to answer user's questions and improve its usability. This fineturn stage is very efficient and can be finished in 7 mins with 1 A100. However, its effectiveness is significant.

Another important finding is that we don't fine-tune the Q-Former like BLIP-2, but directly use the Q-Former aligned with FlanT5 before and only train a single projecting layer. We show that such a simple linear layer is enough to let Vicuna see the image. This makes our training very efficient.

from minigpt-4.

Pilot-LH avatar Pilot-LH commented on August 21, 2024 3

Excellent job! I have some inquiries regarding the model:

  1. Based on my understanding, there are two stages in BLIP-2, and you are utilizing the pre-trained Q-Former from the second stage of BLIP-2 directly aligned with FlanT5 in this model. Please correct me if I am mistaken.
  2. It is intriguing to note that only a linear layer needs to be adjusted in BLIP-2, rather than Q-Former. When you mentioned "after the traditional image-text training stage like BLIP-2 did," were you referring to the first stage, second stage, or both stages of BLIP-2? As far as I know, the first stage of BLIP-2 is not traditional, and it is the key to the success of BLIP-2 (as shown in Figure 5 of the BLIP-2 paper).
  3. Fine-tuning on a small but high-quality dataset appears to be quite effective. Is it possible for BLIP-2 to benefit from this approach? I ask because vicuna is not entirely open source.

from minigpt-4.

TsuTikgiau avatar TsuTikgiau commented on August 21, 2024 3

@Pilot-LH Thanks for your interest!
A1. Yes you are correct, we are directly using the Q-Former aligned with FlanT5 XXL in our model
A2. Here I mean the second stage of BLIP-2, as our first stage pertaining is quite similar to BLIP-2 second stage training. The difference is that we only train one linear layer.
A3. This is a good question. We don't try this, but I think the reason that it works in our case is that Vicuna itself alone is already a close-to-chatgpt level model with a powerful conversation ability. The second stage fine-tuning activates this ability again when visual input is given. Therefore, the training is light. In contrast, Flan-T5's conversation ability is weak. So I guess Flan-T5 should first learn how to chat well with humans. And our small dataset doesn't have this capacity to teach Flan-T5 how to talk. I guess there should be soon some full open-sourced LLMs that work like Vicuna, as how Vicuna is built it is clear. And think our training method can be applied directly once such a LLM is ready.

from minigpt-4.

vateye avatar vateye commented on August 21, 2024

Can you provide more samples for Stage-1 training to verify that Stage-2 is needed?

from minigpt-4.

TsuTikgiau avatar TsuTikgiau commented on August 21, 2024

We plan to update our paper in 2 days to provide some qualitative and quantitative comparisons for the difference between stage-1 and stage-2. Stay tuned!

from minigpt-4.

Pilot-LH avatar Pilot-LH commented on August 21, 2024

Thank you for your response. I now have a clear understanding of the model.
I agree with your point that this approach can be applied to other large language models (LLMs).
In my opinion, one of the major challenges for the open source community is to reproduce LLaMa. Once this is accomplished, there will likely be much more advanced models available than the current Vicuna model.

from minigpt-4.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.