Coder Social home page Coder Social logo

internlm-xcomposer's Introduction

InternLM-XComposer

InternLM-XComposer 🤗 🤖🐼   | InternLM-XComposer-VL 🤗 🤖🐼   | Technical Report 📄

English | 简体中文

Thanks the community for HuggingFace Demo and Replicate Demo

👋 join us on Discord and WeChat


Multimodal Projects of Our Team

InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

ShareGPT4V: Improving Large Multi-modal Models with Better Captions


InternLM-XComposer is a vision-language large model (VLLM) based on InternLM for advanced text-image comprehension and composition. InternLM-XComposer has several appealing properties:

  • Interleaved Text-Image Composition: InternLM-XComposer can effortlessly generate coherent and contextual articles that seamlessly integrate images, providing a more engaging and immersive reading experience. The interleaved text-image composition is implemented in following steps:

    1. Text Generation: It crafts long-form text based on human-provided instructions.
    2. Image Spoting and Captioning: It pinpoints optimal locations for image placement and furnishes image descriptions.
    3. Image Retrieval and Selection: It select image candidates and identify the image that optimally complements the content.
  • Comprehension with Rich Multilingual Knowledge: The text-image comprehension is empowered by training on extensive multi-modal multilingual concepts with carefully crafted strategies, resulting in a deep understanding of visual content.

  • Strong performance: It consistently achieves state-of-the-art results across various benchmarks for vision-language large models, including MME Benchmark (English), MMBench (English), Seed-Bench (English), MMBench-CN(Chinese), and CCBench(Chinese).

We release InternLM-XComposer series in two versions:

  • InternLM-XComposer-VL-7B 🤗 🤖 : The pretrained and multi-task trained VLLM model with InternLM as the initialization of the LLM, achieving strong performance on various multimodal benchmarks, e.g., MME Benchmark, MMBench Seed-Bench, CCBench, and MMBench-CN.
  • InternLM-XComposer-7B 🤗 🤖 : The further instruction tuned VLLM for Interleaved Text-Image Composition and LLM-based AI assistant.

Please refer to Technical Report for more details.

Demo

demo.mp4

Please refer to Chinese Demo for the demo of the Chinese version.

News and Updates

  • 2023.11.22 🎉🎉🎉 We release the ShareGPT4V, a large-scale highly descriptive image-text dataset generated by GPT4-Vision and a superior large multimodal model, ShareGPT4V-7B.
  • 2023.10.30 🎉🎉🎉 InternLM-XComposer-VL achieved the top 1 ranking in both Q-Bench and Tiny LVLM.
  • 2023.10.19 🎉🎉🎉 Support for inference on multiple GPUs. Two 4090 GPUs are sufficient for deploying our demo.
  • 2023.10.12 🎉🎉🎉 4-bit demo is supported, model files are available in Hugging Face and ModelScope.
  • 2023.10.8 🎉🎉🎉 InternLM-XComposer-7B and InternLM-XComposer-VL-7B are publicly available on ModelScope.
  • 2023.9.27 🎉🎉🎉 The evaluation code of InternLM-XComposer-VL-7B are publicly available.
  • 2023.9.27 🎉🎉🎉 InternLM-XComposer-7B and InternLM-XComposer-VL-7B are publicly available on Hugging Face.
  • 2023.9.27 🎉🎉🎉 We release a technical report for more details of our model series.

Evaluation

We evaluate InternLM-XComposer-VL on seven multimodal benchmarks: MME Benchmark, MMBench, Seed-Bench, Q-Bench, Tiny LVLM in the English language, CCBench, MMBench-CN in the simplified chinese language.

  • MME Benchmark: A comprehensive evaluation benchmark for multimodal large language models with 14 subtasks.
  • MMBench: A comprehensive evaluation pipeline comprised of meticulously curated multimodal dataset and a novel circulareval strategy using ChatGPT.
  • MMBench-CN: A simplified chinese language version of MMBench.
  • Seed-Bench: A multimodal benchmark of 19K multiple-choice questions with accurate human annotations for evaluating Multimodal LLMs.
  • CCBench: A multimodal benchmark for chinese cultural comprehension.
  • Q-Bench: A benchmark for general-purpose foundation models on low-level vision.
  • Tiny LVLM: An ability-level multimodal dataset split derived from the LVLM-eHub.

InternLM-XComposer-VL outperforms existing vision-language large models on all the seven benchmarks, demonstrating stronger multilingual comprehension ability.

MME Benchmark

MME is a comprehensive evaluation benchmark for multimodal large language models. It measures both perception and cognition abilities on a total of 14 subtasks, including existence, count, position, color, poster, celebrity, scene, landmark, artwork, OCR, commonsense reasoning, numerical calculation, text translation, and code reasoning.

InternLM-XComposer-VL achieves SOTAs on overall performance evaluation. See more details on HERE.

Overall Performance

Rank Model Version Score
️ 1 InternLM-XComposer-VL InternLM-7B 1919.5
2 Qwen-VL-Chat Qwen-7B 1848.3
3 MMICL FlanT5xxl 1810.7
4 Skywork-MM Skywork-MM-13B 1775.5
5 BLIVA FlanT5xxl 1669.2
leaderboard

MMBench & MMBench-CN

MMBench is a comprehensive evaluation pipeline comprised of meticulously curated multimodal dataset and a novel circulareval strategy using ChatGPT. It is comprised of 20 ability dimensions defined by MMBench. MMBench-CN is the Chinese language version of MMBench.

InternLM-XComposer-VL a chieves SOTAs on the test split of both MMBench and MMBench-CN. See more details on HERE.

MMBench Test Split

Rank Model Version Score
️ 1 InternLM-XComposer-VL InternLM-7B 74.4
2 Pink Vicuna-7B 74.1
3 JiuTian FLANT5-XXL 71.8
4 WeMM InternLM-7B 69.0
5 mPLUG-Owl LLaMA2 7B 68.5
leaderboard

MMBench-CN Test Split

Rank Model Version Score
️ 1 InternLM-XComposer-VL InternLM-7B 72.4
2 QWen-VL-Chat Qwen-7B 56.3
3 LLaVA LLaMA 7B 36.6
4 VisualGLM ChatGLM 6B 25.6
5 mPLUG-Owl LLaMA2 7B 24.9
leaderboard

SEED-Bench

SEED-Bench is a multimodal benchmark of 19K multiple-choice questions with accurate human annotations for evaluating Multimodal LLMs, covering 12 evaluation dimensions including both image and video understanding. See more details on HERE.

InternLM-XComposer-VL achieves SOTAs on this benchmark for images.

SeedBench Image Evaluation

Rank Model Version Score
️ 1 InternLM-XComposer-VL InternLM-7B 66.9
2 QWen-VL-Chat Qwen-7B 65.4
3 QWen-VL Qwen-7B 62.3
4 InstructBLIP-Vicuna Vicuna 7B 58.8
5 InstructBLIP Flan-T5-XL 57.8
leaderboard

CCBench

CCBench is a multimodal benchmark for chinese cultural comprehension. See more details on HERE.

CCBench Performance

Rank Model Version Score
️ 1 InternLM-XComposer-VL InternLM-7B 47.6
2 QWen-VL-Chat Qwen-7B 39.3
3 mPLUG-Owl LLaMA2 7B 12.9
4 InstructBLIP Vicuna 7B 12.1
5 VisualGLM ChatGLM 6B 9.2
leaderboard

Q-Bench

Q-Bench is a benchmark for general-purpose foundation models on low-level vision.

Q-Bench Performance

Rank A1:Perception (dev) A1:Perception (test) A2: Description A3: Assessment
️ 1 InternLM-XComposer-VL
0.6535
InternLM-XComposer-VL
0.6435
InternLM-XComposer-VL
4.21/6
InternLM-XComposer-VL
(0.542, 0.581)
2 LLaVA-v1.5-13B
0.6214
InstrucBLIP-T5-XL
0.6194
Kosmos-2
4.03/6
Qwen-VL
(0.475, 0.506)
3 InstrucBLIP-T5-XL
0.6147
Qwen-VL
0.6167
mPLUG-Owl
3.94/6
LLaVA-v1.5-13B
(0.444, 0.473)
leaderboard

Tiny LVLM

Tiny LVLM is an ability-level multimodal dataset split derived from the LVLM-eHub.

Tiny LVLM Performance

Rank Model Version Score
️ 1 InternLM-XComposer-VL InternLM-7B 322.51
2 Bard Bard 319.59
3 Qwen-VL-Chat Qwen-VL-Chat 316.81
leaderboard

Requirements

  • python 3.8 and above
  • pytorch 1.12 and above, 2.0 and above are recommended
  • CUDA 11.4 and above are recommended (this is for GPU users)

Installation

Before running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the above requirements, and then install the dependent libraries. Please refer to the installation instructions

Quickstart

We provide a simple example to show how to use InternLM-XComposer with 🤗 Transformers.

🤗 Transformers
import torch
from transformers import AutoModel, AutoTokenizer

torch.set_grad_enabled(False)

# init model and tokenizer
model = AutoModel.from_pretrained('internlm/internlm-xcomposer-7b', trust_remote_code=True).cuda().eval()
tokenizer = AutoTokenizer.from_pretrained('internlm/internlm-xcomposer-7b', trust_remote_code=True)
model.tokenizer = tokenizer

# example image
image = 'examples/images/aiyinsitan.jpg'

# Single-Turn Pure-Text Dialogue
text = 'Please introduce Einstein.'
response = model.generate(text)
print(response)
# Albert Einstein was a German-born theoretical physicist who developed the general theory of relativity, one of the 
# two pillars of modern physics (alongside quantum mechanics). He is best known for his mass–energy equivalence 
# formula E = mc2 (which has been dubbed "the world's most famous equation"), and his explanation of the photoelectric 
# effect, both of which are examples of his special and general theories of relativity. Einstein is widely regarded as 
# one of the most influential physicists of all time.


# Single-Turn Text-Image Dialogue
text = 'Please introduce the person in this picture in detail.'
image = 'examples/images/aiyinsitan.jpg'
response = model.generate(text, image)
print(response)
# The person in the picture is Albert Einstein, a renowned theoretical physicist and one of the most influential 
# scientists of the 20th century. He is depicted in a black and white portrait, wearing a suit and tie, and has a 
# serious expression on his face.


# Multi-Turn Text-Image Dialogue
# 1st turn
text = 'Who is in the picture?'
response, history = model.chat(text=text, image=image, history=None)
print(response)
# Albert Einstein is in the picture.

# 2nd turn
text = 'What are his achievements?'
response, history = model.chat(text=text, image=None, history=history)
print(response)
# Albert Einstein was a German-born theoretical physicist who developed the general theory of relativity, 
# one of the two pillars of modern physics (alongside quantum mechanics). He is best known for his mass–energy 
# equivalence formula E = mc2 (which has been dubbed "the world's most famous equation"), and his explanation of 
# the photoelectric effect, both of which are examples of his special and general theories of relativity.

# 3rd turn
text = 'Is he the greatest physicist?'
response, history = model.chat(text=text, image=None, history=history)
print(response)
# Yes, Albert Einstein is widely regarded as one of the greatest physicists of all time.
🤖 ModelScope
import torch
from modelscope import snapshot_download, AutoModel, AutoTokenizer

torch.set_grad_enabled(False)

# init model and tokenizer
model_dir = snapshot_download('Shanghai_AI_Laboratory/internlm-xcomposer-7b')
model = AutoModel.from_pretrained(model_dir, trust_remote_code=True).cuda().eval()
tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True)
model.tokenizer = tokenizer

# example image
image = 'examples/images/aiyinsitan.jpg'

# Single-Turn Pure-Text Dialogue
text = 'Please introduce Einstein.'
response = model.generate(text)
print(response)
# Albert Einstein was a German-born theoretical physicist who developed the general theory of relativity, one of the 
# two pillars of modern physics (alongside quantum mechanics). He is best known for his mass–energy equivalence 
# formula E = mc2 (which has been dubbed "the world's most famous equation"), and his explanation of the photoelectric 
# effect, both of which are examples of his special and general theories of relativity. Einstein is widely regarded as 
# one of the most influential physicists of all time.

Web UI

Thanks the community for 3rd-party HuggingFace Demo and Replicate Demo

We provide code for users to build a web UI demo.

Please run the command below (GPU memory >= 32GB, Recommended):

python examples/web_demo.py

The user guidance of UI demo is given in HERE. If you wish to change the default folder of the model, please use the --folder=new_folder option.

Quantilization

We provide 4-bit quantized models to ease the memory requirement of the models. To run the 4-bit models (GPU memory >= 12GB), you need first install the corresponding dependency, then execute the follows scripts for chat and web demo:

# 4-bit chat
python examples/example_chat_4bit.py
# 4-bit web demo
python examples/web_demo_4bit.py

Inference on Multiple GPUs

If you have multiple GPUs, but the memory size of each GPU is not enough to accommodate the entire model, you can split the model across multiple GPUs. First, install accelerate using the command: pip install accelerate. Then, execute the follows scripts for chat and web demo:

# chat with 2 GPUs
python examples/example_chat.py --num_gpus 2
# web demo with 2 GPUs
python examples/web_demo.py --num_gpus 2

Calculate TFLOPs and Params

Required package pip install calflops

# text = 'Please introduce the person in this picture in detail.'
# image = 'examples/images/aiyinsitan.jpg'
python examples/example_params_and_flops.py

The expected output is FLOPs: 17.6 TFLOPS, Params: 8.8 B.

Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝 :)

@misc{zhang2023internlmxcomposer,
      title={InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition}, 
      author={Pan Zhang and Xiaoyi Dong and Bin Wang and Yuhang Cao and Chao Xu and Linke Ouyang and Zhiyuan Zhao and Shuangrui Ding and Songyang Zhang and Haodong Duan and Wenwei Zhang and Hang Yan and Xinyue Zhang and Wei Li and Jingwen Li and Kai Chen and Conghui He and Xingcheng Zhang and Yu Qiao and Dahua Lin and Jiaqi Wang},
      year={2023},
      eprint={2309.15112},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

License & Contact Us

The code is licensed under Apache-2.0, while model weights are fully open for academic research and also allow free commercial usage. To apply for a commercial license, please fill in the application form (English)/申请表(中文). For other questions or collaborations, please contact [email protected].

internlm-xcomposer's People

Contributors

eltociear avatar li-jinsong avatar lightdxy avatar myownskyw7 avatar panzhang0212 avatar sdc17 avatar v3det avatar vansin avatar xiaoachen98 avatar yhcao6 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.