bradyfu / awesome-multimodal-large-language-models Goto Github PK
View Code? Open in Web Editor NEW:sparkles::sparkles:Latest Advances on Multimodal Large Language Models
:sparkles::sparkles:Latest Advances on Multimodal Large Language Models
如题
I re-evaluate the performance of Mini-GPT4 with Vicuna-13B on MME, using the MMBench code base. I got scores of 580.5 for perception and 144.29 for cognition, which are very close to the result in the official leaderboard. A system prompt as in official code for Mini-GPT4 is set up.
However, in the original paper, I notice that there are huge gaps of performance in reference of Figure 2. The score of perception is 866.58 and that of cognition is 292.14, ranking the first place.
I wonder where does this difference come from and which one should be taken as correct evaluation.
Hi,
We have a new accepted paper, which might be related to mm in-context learning and LLM-aided visual reasoning.
ICL-D3IE: In-context learning with diverse demonstrations updating for document information extraction
Thanks!
Hey,
We have two papers that might be relevant to this repo:
Best
Hi, thanks for compiling this list! I hope to bring the following works from my team to your attention:
Very nice and impressive work! Your wechat ID seems to be frequently added with restriction. Is there any other way to join the Wechat group? Thanks!
Hi, I wonder to know if a model randomly outputs yes or no for every question, What performance will it get?
https://arxiv.org/abs/2307.02499
we propose mPLUG-DocOwl based on mPLUG-Owl for OCR-free document understanding. Specifically, we first construct a instruction tuning dataset featuring a wide range of visual-text understanding tasks. Then, we strengthen the OCR-free document understanding ability by jointly train the model on language-only, general vision-and-language, and document instruction tuning dataset with our unified instruction tuning strategy. We also build an OCR-free document instruction understanding evaluation set LLMDoc to better compare models' capabilities on instruct compliance and document understanding. Experimental results show that our model outperforms existing multi-modal models, demonstrating its strong ability of document understanding. Besides, without specific fine-tuning, mPLUG-DocOwl generalizes well on various downstream tasks. We open-source our code at this https URL and provide an interactive demo.
Thanks for this awesome repo!
Our work (Polite Flamingo) has some updated links.
Demo link: http://clever_flamingo.xiaoice.com/
Dataset link: https://huggingface.co/datasets/chendelong/PF-1M/tree/main
Suggesting to modifiy the dataset note of PF-1M into:
A collection of 37 vision-language dataset with responses rewriten by Polite Flamingo.
Thanks a lot~
Thanks for organizing such useful repo for multimodal papers. I am the author of the multiInstruct paper. We recently open sourced all the datasets and instructions used in our multimodal Instruction tuning paper: MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning. After submission, we explain the number of tasks to 62.
Here is the GitHub link: https://github.com/VT-NLP/MultiInstruct
Could please help me to update the GitHub link in your repo?
BTW, we plan to release additional 150 diverse vision-language tasks next month.
We are honored to evaluate the Qwen-VL series on your good work MME Benchmark.
Qwen-VL-Chat achieved the SOTAs on MME until now. We provide all code and steps HERE to reproduce the results.
We would appreciate it if you update these changes on your home page and pictures as soon as possible.
=========== Perception ===========
total score: 1487.576330532213
existence score: 158.33333333333331
count score: 150.0
position score: 128.33333333333334
color score: 170.0
posters score: 178.57142857142856
celebrity score: 120.58823529411764
scene score: 152.25
landmark score: 164.0
artwork score: 125.5
OCR score: 140.0
=========== Cognition ===========
total score: 360.71428571428567
commonsense_reasoning score: 130.7142857142857
numerical_calculation score: 40.0
text_translation score: 147.5
code_reasoning score: 42.5
Hello,
This is a very nice benchmark for evaluating MLLMs.
Could you help us to evaluate our released LMEye variant shown in https://github.com/YunxinLi/LingCloud/tree/main/LMEye
Thanks.
Hi, thanks for the great work! BLIP-2's FlanT5-xxl use bfloat16, while V100 does not support bfloat16. As shown in your paper, all your experiments are done using V100 GPU. I also use V100 in my lab. Are there any methods to run BLIP-2 FlanT5 on the V100 GPU?
Hi,
Llava 1.5 and MiniGPT-v2 are released recently.
Does the Leaderboard reflect the latest updates?
Thanks,
Hello, thank you for the wonderful work and the dataset!
I was trying to download the landmark images by running MME_Benchmark_release_version/landmark/images/download_landmark.py. But only ~35 images were successfully downloaded, while others having error: Failed to download the URL file.
Is there other alternative sources for downloading these images? Or did I do anything incorrectly.
Thanks for the help in advance :)
The GitHub link to ICL-D3IE is now accessible. If possible, could you add it to the readme?
Furthermore, we have a new paper about Multimodal Chain-of-Thought. Would it be possible to add this paper to your paper list? The paper's name is T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering. (https://arxiv.org/abs/2305.03453)
Thank you.
Do you have plan for supporting Chinese evaluation of MLLM of the MME benchmark
RT
https://arxiv.org/abs/2306.16410
We propose LENS, a modular approach for tackling computer vision problems by leveraging the power of large language models (LLMs). Our system uses a language model to reason over outputs from a set of independent and highly descriptive vision modules that provide exhaustive information about an image. We evaluate the approach on pure computer vision settings such as zero- and few-shot object recognition, as well as on vision and language problems. LENS can be applied to any off-the-shelf LLM and we find that the LLMs with LENS perform highly competitively with much bigger and much more sophisticated systems, without any multimodal training whatsoever. We open-source our code at this https URL and provide an interactive demo.
I hope you can add Pink to the list. https://github.com/SY-Xuan/Pink.
Best.
It seems like the link is not avail.
I want to learn how to Creating a Python Radar Chart like yours.
I wonder whether you could share the code. Thanks!
Great work! Can you evaluate llava-13B base on vicuna-13B v1.1? Thanks!
请问指的是AI海报制作吗?还是海报相关什么的~
Thanks for the amazing survey. We have a related work and hope you could consider to include.
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn (https://arxiv.org/abs/2306.08640)
Project Page: https://assistgpt-project.github.io/
Can you help me add our paper HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models in the evaluation section? Thanks!
Hi, thanks for curating this awesome list! There is a recent paper that evaluates several MLLM (InstructBLIP, BLIP2 etc) on more challenging human exam questions requiring images:
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models
arXiv: https://arxiv.org/pdf/2306.05179.pdf
Github: https://github.com/DAMO-NLP-SG/M3Exam
Maybe you can consider adding this since it quite relates to this repo. Thanks!
I feel confused about those two branches since it seems the evaluation branch is out of sync with the main branch...
Anyone could confirm which branch I should refer to if I want to commit?
Mindstorms in Natural Language-Based Societies of Mind from KAUST AI Initiative
Such exciting project list of multimodal LLM. We have a related work that we hope is added to this awesome repository.
Paper Title: LMEye: An Interactive Perception Network for Large Language Models
What is the classification basis for the category of "Foundation Models"? For example, why flamingo and mPLUG-Owl are not foundation models?
I wanna know why so many models achieve scores lower than 75 in Fig 2, but the random accuracies of the two metrics are equal to 50% and 25%. Didnt they follow the instruction? Didnt they answer yes/no?
Hi, thanks for this awesome repo!
The GitHub link for SVIT is ready: https://github.com/BAAI-DCAI/Visual-Instruction-Tuning.
If convenient, could you help to update it to the list?
Thank you!
RT
I found that 03d5e3bfc958be38.jpg could not be downloaded. Could you fixed the link or upload the image? the link is:
https://upload.wikimedia.org/wikipedia/commons/a/a2/Pietrarsa_railway_museum_67.JPG
Hi, Awesome-Multimodal-Large-Language-Models is a nice repo with a great hierarchical structure. Below are some suggestions that might be helpful.
Some references are crucial but neglected in the research track of Awesome-Multimodal-Large-Language-Models, such as VL-T5, FrozenBILM, VL-Adapter and LST. Also feel free to diff and complete the current repo with my research trends.
For the survey, there exists another concern that needs careful consideration: Is it worthwhile for our research community to follow tool-oriented technical reports? Although it may be difficult to determine until the result of the next top conference, I believe you could handle this matter properly.
By the way, is there any way to participate in the ongoing MLLM survey? Thank you very much for your time.
The performance of GIT2 in the leaderboard is quite impressive. It only has 5.1B parameters. The original paper was published in 2022 and their repository has not been updated since March 2023. The original GIT and GIT2 models did not use techniques like instruct fine-tuning. However, GIT2 still beats many state-of-the-art models in August 2023.
The performance comes from a newer close-source variant from Microsoft, or an open-source version, or the original GIT2 in 2022?
Thanks for the curated list of multimodal LLM. We have a related work that we hope is added to this awesome repository.
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation (https://arxiv.org/pdf/2303.05983.pdf)
Project Page: https://matrix-alpha.github.io/
Shikra, an MLLM designed to kick off referential dialogue by excelling in spatial coordinate inputs/outputs in natural language, without additional vocabularies, position encoders, pre-/post-detection, or external plug-in models.
arXiv:https://arxiv.org/abs/2306.15195
code: https://github.com/shikras/shikra
May I know where could I load the images such as tt0074749.jpg
to evaluate our model? I can see some of them are COCO format but some are not. What could I do if I wish to submit our own result?
Hi, this is the author of LAMM. We have released data, benchmark and model of LAMM. Please update the link in your list. Thank you for your support :)
Github: https://github.com/OpenLAMM/LAMM
Hi! Shikra's online demo is ready.
If possible, please update it to the list :)
http://demo.zhaozhang.net:7860/
Thank you!
Hello, I first would like to say thank you for this great repository.
I evaluated mPLUG-Owl using their official prompt as following:
The following is a conversation between a curious human and AI assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
Human: <image>
Human: {question}
AI:
But I got score of about 1100, which is worse than the reported number in the MME paper (about 1250).
What is the exact prompt you used for evaluatation?
We have a related work that we hope is added to this awesome repository. Thanks a lot.
Paper Title: Explainable Multimodal Emotion Reasoning
Paper link: https://arxiv.org/pdf/2306.15401.pdf
Project: https://github.com/zeroQiaoba/Explainable-Multimodal-Emotion-Reasoning
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.