Coder Social home page Coder Social logo

multimodal-gpt's Introduction

๐Ÿค– Multi-modal GPT

Train a multi-modal chatbot with visual and language instructions!

Based on the open-source multi-modal model OpenFlamingo, we create various visual instruction data with open datasets, including VQA, Image Captioning, Visual Reasoning, Text OCR, and Visual Dialogue. Additionally, we also train the language model component of OpenFlamingo using only language-only instruction data.

The joint training of visual and language instructions effectively improves the performance of the model! For more details please refer to our technical report.

Welcome to join us!

Features

  • Support various vision and language instruction data
  • Parameter efficient fine-tuning with LoRA
  • Tuning vision and language at the same time, complement each other

Installation

To install the package in an existing environment, run

git clone https://github.com/open-mmlab/Multimodal-GPT.git
cd Multimodal-GPT
pip install -r requirements.txt
pip install -v -e .

or create a new conda environment

conda env create -f environment.yml

Launch Demo Locally

  1. Download the pre-trained weights.

    Use this script for converting LLaMA weights to Hugging Face format.

    Download the OpenFlamingo pre-trained model from openflamingo/OpenFlamingo-9B.

    Download our LoRA Weight from here.

    Then place these models in checkpoints folders like this:

    checkpoints
    โ”œโ”€โ”€ llama-7b_hf
    โ”‚   โ”œโ”€โ”€ config.json
    โ”‚   โ”œโ”€โ”€ pytorch_model-00001-of-00002.bin
    โ”‚   โ”œโ”€โ”€ ......
    โ”‚   โ””โ”€โ”€ tokenizer.model
    โ”œโ”€โ”€ OpenFlamingo-9B
    โ”‚   โ””โ”€โ”€checkpoint.pt
    โ”œโ”€โ”€mmgpt-lora-v0-release.pt
    
    
  2. launch the gradio demo

    python app.py

Examples

Recipe:

image4

Travel plan:

image3

Movie:

image2

Famous person:

image

Fine-tuning

Prepare datasets

  1. A-OKVQA

    Download annotation from this link and unzip to data/aokvqa/annotations.

    It also requires images from coco dataset which can be downloaded from here.

  2. COCO Caption

    Download from this link and unzip to data/coco.

    It also requires images from coco dataset which can be downloaded from here.

  3. OCR VQA

    Download from this link and place in data/OCR_VQA/.

  4. LlaVA

    Download from liuhaotian/LLaVA-Instruct-150K and place in data/llava/.

    It also requires images from coco dataset which can be downloaded from here.

  5. Mini-GPT4

    Download from Vision-CAIR/cc_sbu_align and place in data/cc_sbu_align/.

  6. Dolly 15k

    Download from databricks/databricks-dolly-15k and place it in data/dolly/databricks-dolly-15k.jsonl.

  7. Alpaca GPT4

    Download it from this link and place it in data/alpaca_gpt4/alpaca_gpt4_data.json.

You can also customize the data path in the configs/dataset_config.py.

  1. Baize

    Download it from this link and place it in data/baize/quora_chat_data.json.

Start training

torchrun --nproc_per_node=8 mmgpt/train/instruction_finetune.py \
  --lm_path checkpoints/llama-7b_hf \
  --tokenizer_path checkpoints/llama-7b_hf \
  --pretrained_path checkpoints/OpenFlamingo-9B/checkpoint.pt \
  --run_name train-my-gpt4 \
  --learning_rate 1e-5 \
  --lr_scheduler cosine \
  --batch_size 1 \ 
  --tuning_config configs/lora_config.py \
  --dataset_config configs/dataset_config.py \
  --report_to_wandb

Acknowledgements

If you find our project useful for your research and applications, please cite using this BibTeX:

@misc{gong2023multimodalgpt,
      title={MultiModal-GPT: A Vision and Language Model for Dialogue with Humans}, 
      author={Tao Gong and Chengqi Lyu and Shilong Zhang and Yudong Wang and Miao Zheng and Qian Zhao and Kuikun Liu and Wenwei Zhang and Ping Luo and Kai Chen},
      year={2023},
      eprint={2305.04790},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

multimodal-gpt's People

Contributors

eltociear avatar fly2tomato avatar gt9505 avatar harold-lkk avatar matrixgame2018 avatar nioolek avatar openmmlab-assistant005 avatar rangeking avatar rangilyu avatar vansin avatar yhna940 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.