The internlm-xcomposer from xiaoachen98

InternLM-XComposer

InternLM-XComposer 🤗 🤖🐼 ｜ InternLM-XComposer-VL 🤗 🤖🐼 | Technical Report 📄

Thanks the community for HuggingFace Demo and Replicate Demo

👋 join us on Discord and WeChat

Multimodal Projects of Our Team

InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

ShareGPT4V: Improving Large Multi-modal Models with Better Captions

InternLM-XComposer is a vision-language large model (VLLM) based on InternLM for advanced text-image comprehension and composition. InternLM-XComposer has several appealing properties:

Interleaved Text-Image Composition: InternLM-XComposer can effortlessly generate coherent and contextual articles that seamlessly integrate images, providing a more engaging and immersive reading experience. The interleaved text-image composition is implemented in following steps:
1. Text Generation: It crafts long-form text based on human-provided instructions.
2. Image Spoting and Captioning: It pinpoints optimal locations for image placement and furnishes image descriptions.
3. Image Retrieval and Selection: It select image candidates and identify the image that optimally complements the content.
Comprehension with Rich Multilingual Knowledge: The text-image comprehension is empowered by training on extensive multi-modal multilingual concepts with carefully crafted strategies, resulting in a deep understanding of visual content.
Strong performance: It consistently achieves state-of-the-art results across various benchmarks for vision-language large models, including MME Benchmark (English), MMBench (English), Seed-Bench (English), MMBench-CN(Chinese), and CCBench(Chinese).

We release InternLM-XComposer series in two versions:

InternLM-XComposer-VL-7B 🤗 🤖 : The pretrained and multi-task trained VLLM model with InternLM as the initialization of the LLM, achieving strong performance on various multimodal benchmarks, e.g., MME Benchmark, MMBench Seed-Bench, CCBench, and MMBench-CN.
InternLM-XComposer-7B 🤗 🤖 : The further instruction tuned VLLM for Interleaved Text-Image Composition and LLM-based AI assistant.

Please refer to Technical Report for more details.

Demo

demo.mp4

Please refer to Chinese Demo for the demo of the Chinese version.

News and Updates

2023.11.22 🎉🎉🎉 We release the ShareGPT4V, a large-scale highly descriptive image-text dataset generated by GPT4-Vision and a superior large multimodal model, ShareGPT4V-7B.
2023.10.30 🎉🎉🎉 InternLM-XComposer-VL achieved the top 1 ranking in both Q-Bench and Tiny LVLM.
2023.10.19 🎉🎉🎉 Support for inference on multiple GPUs. Two 4090 GPUs are sufficient for deploying our demo.
2023.10.12 🎉🎉🎉 4-bit demo is supported, model files are available in Hugging Face and ModelScope.
2023.10.8 🎉🎉🎉 InternLM-XComposer-7B and InternLM-XComposer-VL-7B are publicly available on ModelScope.
2023.9.27 🎉🎉🎉 The evaluation code of InternLM-XComposer-VL-7B are publicly available.
2023.9.27 🎉🎉🎉 InternLM-XComposer-7B and InternLM-XComposer-VL-7B are publicly available on Hugging Face.
2023.9.27 🎉🎉🎉 We release a technical report for more details of our model series.

Evaluation

We evaluate InternLM-XComposer-VL on seven multimodal benchmarks: MME Benchmark, MMBench, Seed-Bench, Q-Bench, Tiny LVLM in the English language, CCBench, MMBench-CN in the simplified chinese language.

MME Benchmark: A comprehensive evaluation benchmark for multimodal large language models with 14 subtasks.
MMBench: A comprehensive evaluation pipeline comprised of meticulously curated multimodal dataset and a novel circulareval strategy using ChatGPT.
MMBench-CN: A simplified chinese language version of MMBench.
Seed-Bench: A multimodal benchmark of 19K multiple-choice questions with accurate human annotations for evaluating Multimodal LLMs.
CCBench: A multimodal benchmark for chinese cultural comprehension.
Q-Bench: A benchmark for general-purpose foundation models on low-level vision.
Tiny LVLM: An ability-level multimodal dataset split derived from the LVLM-eHub.

InternLM-XComposer-VL outperforms existing vision-language large models on all the seven benchmarks, demonstrating stronger multilingual comprehension ability.

MME Benchmark

MME is a comprehensive evaluation benchmark for multimodal large language models. It measures both perception and cognition abilities on a total of 14 subtasks, including existence, count, position, color, poster, celebrity, scene, landmark, artwork, OCR, commonsense reasoning, numerical calculation, text translation, and code reasoning.

InternLM-XComposer-VL achieves SOTAs on overall performance evaluation. See more details on HERE.

Overall Performance

Rank	Model	Version	Score
️ 1	InternLM-XComposer-VL	InternLM-7B	1919.5
2	Qwen-VL-Chat	Qwen-7B	1848.3
3	MMICL	FlanT5xxl	1810.7
4	Skywork-MM	Skywork-MM-13B	1775.5
5	BLIVA	FlanT5xxl	1669.2

leaderboard

MMBench & MMBench-CN

MMBench is a comprehensive evaluation pipeline comprised of meticulously curated multimodal dataset and a novel circulareval strategy using ChatGPT. It is comprised of 20 ability dimensions defined by MMBench. MMBench-CN is the Chinese language version of MMBench.

InternLM-XComposer-VL a chieves SOTAs on the test split of both MMBench and MMBench-CN. See more details on HERE.

MMBench Test Split

Rank	Model	Version	Score
️ 1	InternLM-XComposer-VL	InternLM-7B	74.4
2	Pink	Vicuna-7B	74.1
3	JiuTian	FLANT5-XXL	71.8
4	WeMM	InternLM-7B	69.0
5	mPLUG-Owl	LLaMA2 7B	68.5

leaderboard

MMBench-CN Test Split

Rank	Model	Version	Score
️ 1	InternLM-XComposer-VL	InternLM-7B	72.4
2	QWen-VL-Chat	Qwen-7B	56.3
3	LLaVA	LLaMA 7B	36.6
4	VisualGLM	ChatGLM 6B	25.6
5	mPLUG-Owl	LLaMA2 7B	24.9

leaderboard

SEED-Bench

SEED-Bench is a multimodal benchmark of 19K multiple-choice questions with accurate human annotations for evaluating Multimodal LLMs, covering 12 evaluation dimensions including both image and video understanding. See more details on HERE.

InternLM-XComposer-VL achieves SOTAs on this benchmark for images.

SeedBench Image Evaluation

Rank	Model	Version	Score
️ 1	InternLM-XComposer-VL	InternLM-7B	66.9
2	QWen-VL-Chat	Qwen-7B	65.4
3	QWen-VL	Qwen-7B	62.3
4	InstructBLIP-Vicuna	Vicuna 7B	58.8
5	InstructBLIP	Flan-T5-XL	57.8

leaderboard

CCBench

CCBench is a multimodal benchmark for chinese cultural comprehension. See more details on HERE.

CCBench Performance

Rank	Model	Version	Score
️ 1	InternLM-XComposer-VL	InternLM-7B	47.6
2	QWen-VL-Chat	Qwen-7B	39.3
3	mPLUG-Owl	LLaMA2 7B	12.9
4	InstructBLIP	Vicuna 7B	12.1
5	VisualGLM	ChatGLM 6B	9.2

leaderboard

Q-Bench

Q-Bench is a benchmark for general-purpose foundation models on low-level vision.

Q-Bench Performance

Rank	A1：Perception (dev)	A1：Perception (test)	A2: Description	A3: Assessment
️ 1	InternLM-XComposer-VL 0.6535	InternLM-XComposer-VL 0.6435	InternLM-XComposer-VL 4.21/6	InternLM-XComposer-VL (0.542, 0.581)
2	LLaVA-v1.5-13B 0.6214	InstrucBLIP-T5-XL 0.6194	Kosmos-2 4.03/6	Qwen-VL (0.475, 0.506)
3	InstrucBLIP-T5-XL 0.6147	Qwen-VL 0.6167	mPLUG-Owl 3.94/6	LLaVA-v1.5-13B (0.444, 0.473)

leaderboard

Tiny LVLM

Tiny LVLM is an ability-level multimodal dataset split derived from the LVLM-eHub.

Tiny LVLM Performance

Rank	Model	Version	Score
️ 1	InternLM-XComposer-VL	InternLM-7B	322.51
2	Bard	Bard	319.59
3	Qwen-VL-Chat	Qwen-VL-Chat	316.81

leaderboard

Requirements

python 3.8 and above
pytorch 1.12 and above, 2.0 and above are recommended
CUDA 11.4 and above are recommended (this is for GPU users)

Installation

Before running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries. Please refer to the installation instructions

Quickstart

We provide a simple example to show how to use InternLM-XComposer with 🤗 Transformers.

🤗 Transformers

import torch
from transformers import AutoModel, AutoTokenizer

torch.set_grad_enabled(False)

# init model and tokenizer
model = AutoModel.from_pretrained('internlm/internlm-xcomposer-7b', trust_remote_code=True).cuda().eval()
tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer-7b', trust_remote_code=True)
model.tokenizer = tokenizer

# example image
image = 'examples/images/aiyinsitan.jpg'

# Single-Turn Pure-Text Dialogue
text = 'Please introduce Einstein.'
response = model.generate(text)
print(response)
# Albert Einstein was a German-born theoretical physicist who developed the general theory of relativity, one of the 
# two pillars of modern physics (alongside quantum mechanics). He is best known for his mass–energy equivalence 
# formula E = mc2 (which has been dubbed "the world's most famous equation"), and his explanation of the photoelectric 
# effect, both of which are examples of his special and general theories of relativity. Einstein is widely regarded as 
# one of the most influential physicists of all time.


# Single-Turn Text-Image Dialogue
text = 'Please introduce the person in this picture in detail.'
image = 'examples/images/aiyinsitan.jpg'
response = model.generate(text, image)
print(response)
# The person in the picture is Albert Einstein, a renowned theoretical physicist and one of the most influential 
# scientists of the 20th century. He is depicted in a black and white portrait, wearing a suit and tie, and has a 
# serious expression on his face.


# Multi-Turn Text-Image Dialogue
# 1st turn
text = 'Who is in the picture?'
response, history = model.chat(text=text, image=image, history=None)
print(response)
# Albert Einstein is in the picture.

# 2nd turn
text = 'What are his achievements?'
response, history = model.chat(text=text, image=None, history=history)
print(response)
# Albert Einstein was a German-born theoretical physicist who developed the general theory of relativity, 
# one of the two pillars of modern physics (alongside quantum mechanics). He is best known for his mass–energy 
# equivalence formula E = mc2 (which has been dubbed "the world's most famous equation"), and his explanation of 
# the photoelectric effect, both of which are examples of his special and general theories of relativity.

# 3rd turn
text = 'Is he the greatest physicist?'
response, history = model.chat(text=text, image=None, history=history)
print(response)
# Yes, Albert Einstein is widely regarded as one of the greatest physicists of all time.

🤖 ModelScope

import torch
from modelscope import snapshot_download, AutoModel, AutoTokenizer

torch.set_grad_enabled(False)

# init model and tokenizer
model_dir = snapshot_download('Shanghai_AI_Laboratory/internlm-xcomposer-7b')
model = AutoModel.from_pretrained(model_dir, trust_remote_code=True).cuda().eval()
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model.tokenizer = tokenizer

# example image
image = 'examples/images/aiyinsitan.jpg'

# Single-Turn Pure-Text Dialogue
text = 'Please introduce Einstein.'
response = model.generate(text)
print(response)
# Albert Einstein was a German-born theoretical physicist who developed the general theory of relativity, one of the 
# two pillars of modern physics (alongside quantum mechanics). He is best known for his mass–energy equivalence 
# formula E = mc2 (which has been dubbed "the world's most famous equation"), and his explanation of the photoelectric 
# effect, both of which are examples of his special and general theories of relativity. Einstein is widely regarded as 
# one of the most influential physicists of all time.

Web UI

Thanks the community for 3rd-party HuggingFace Demo and Replicate Demo

We provide code for users to build a web UI demo.

Please run the command below (GPU memory >= 32GB, Recommended):

python examples/web_demo.py

The user guidance of UI demo is given in HERE. If you wish to change the default folder of the model, please use the --folder=new_folder option.

Quantilization

We provide 4-bit quantized models to ease the memory requirement of the models. To run the 4-bit models (GPU memory >= 12GB), you need first install the corresponding dependency, then execute the follows scripts for chat and web demo:

# 4-bit chat
python examples/example_chat_4bit.py
# 4-bit web demo
python examples/web_demo_4bit.py

Inference on Multiple GPUs

If you have multiple GPUs, but the memory size of each GPU is not enough to accommodate the entire model, you can split the model across multiple GPUs. First, install accelerate using the command: pip install accelerate. Then, execute the follows scripts for chat and web demo:

# chat with 2 GPUs
python examples/example_chat.py --num_gpus 2
# web demo with 2 GPUs
python examples/web_demo.py --num_gpus 2

Calculate TFLOPs and Params

Required package pip install calflops

# text = 'Please introduce the person in this picture in detail.'
# image = 'examples/images/aiyinsitan.jpg'
python examples/example_params_and_flops.py

The expected output is FLOPs: 17.6 TFLOPS, Params: 8.8 B.

Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝 :)

@misc{zhang2023internlmxcomposer,
      title={InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition}, 
      author={Pan Zhang and Xiaoyi Dong and Bin Wang and Yuhang Cao and Chao Xu and Linke Ouyang and Zhiyuan Zhao and Shuangrui Ding and Songyang Zhang and Haodong Duan and Wenwei Zhang and Hang Yan and Xinyue Zhang and Wei Li and Jingwen Li and Kai Chen and Conghui He and Xingcheng Zhang and Yu Qiao and Dahua Lin and Jiaqi Wang},
      year={2023},
      eprint={2309.15112},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

License & Contact Us

The code is licensed under Apache-2.0, while model weights are fully open for academic research and also allow free commercial usage. To apply for a commercial license, please fill in the application form (English)/申请表（中文）. For other questions or collaborations, please contact [email protected].

xiaoachen98 / internlm-xcomposer Goto Github PK

internlm-xcomposer's Introduction

Multimodal Projects of Our Team

Demo

News and Updates

Evaluation

MME Benchmark

MMBench & MMBench-CN

SEED-Bench

CCBench

Q-Bench

Tiny LVLM

Requirements

Installation

Quickstart

Web UI

Quantilization

Inference on Multiple GPUs

Calculate TFLOPs and Params

Citation

License & Contact Us

internlm-xcomposer's People

Contributors

Recommend Projects

Recommend Topics

Recommend Org