Comments (3)
The section of fine-tuning on ScienceQA is similar with MM-COT, in terms of architectures and data organization. One major finding in MM-CoT is that prediction order matters: reason-then-answer, and thus it is called "CoT". In our study, we find that the "CoT" claim is not very important. See the evidence in our paper:
Chain-of-thoughts. To decide the order between the answer and reasoning process in the model prediction, we run both variants and observe that answer-first reports the best number 89.77% accuracy in 12 epochs, while reasoning-first can quickly reach 89.77% accuracy in 6 epochs, but no further improvement with more training (89.96%). Training the model for 24 epochs does not improve the performance. We conclude that CoT-like reasoning-first strategy can largely improve convergence speed, but contributes relatively little to the final performance.
Now, for both papers, the most performance gain is the use of vision encoder and the end-to-end training on ScienceQA. This dataset is relatively small compared with VQA 2.0. It is easy to reach high performance by training a large model on it. I hope this should be noted for readers to make solid conclusions. Further, there are implementation difference between the two papers for us to reach the different conclusions: (1) The choice of LLM; (2) We have pre-training stage to connect the two modality, which leads 5% improvement compared with training from scratch, while MMCoT does not has this pre-training stage. Hope it can re-considered in the development.
I'd like to clarify that ScienceQA has helped us quantitively ablate our design choice in the early stage of the project, but ScienceQA is not the single main focus of this project. We aim to help the community produce multimodal GPT-4 level capability with minimum efforts: (1) From focus shift from model-centric to data-centric AI: the multimodal instruction-following data is the key, and the most of our time is spent on. (2) Achieving multimodal chat with detailed description such as OCR and complex reasoning. The current demo has preliminary capabilities on this. Hope the community can be inspired to scale up this approach to reach better performance.
from llava.
the other important part is the synthetic gpt-4 based dataset.
from llava.
In our study, we find that the "CoT" claim is not very important. See the evidence in our paper:
Wow, thanks for the work!
from llava.
Related Issues (20)
- device mis-match error on pre-training
- Deepspeed Assertion Error after training is completed while saving check points
- How to fine tune Lora without images……
- [Question] After merging, not able to infer from the model llava-mistral-v1.6-7b
- "Argo Tunnel error" when using demo via https://llava.hliu.cc/ => demo is not working
- [Usage] Different seeds are giving the exact same loss when running full finetuning with deepspeed Zero 1,2 or 3
- [Question] I got stuck here while doing fine-tuning training.
- [Question] For image_aspect_ratio, what is the difference between pad, square and anyres
- LLava always speaks of 2 images HOT 1
- [Usage] After fine-tuning LLaVA 1.5, mm_projector.bin file is not available HOT 1
- [Question]
- 我的模型用这个SFT 之后会输出<b>作为开头,请问有知道为什么的么?
- [Question] The output of the llava-v1.6 model does not match my input prompt
- Some running questions about commit "Add NPU Support to LLaVA"[Usage]
- [Usage] how to load the trained lora? HOT 3
- [Usage] Unfixed Random Seed and Get the Different Result in each Experiments
- Need either a `state_dict` or a `save_folder` containing offloaded weights.
- [Question] About The Paper Type
- LLaVA Colab notebook
- [Question] Imbalanced and Unstable GPU Usage during the training with Deepspeed
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from llava.