Coder Social home page Coder Social logo

c-guevara / collavo Goto Github PK

View Code? Open in Web Editor NEW

This project forked from byungkwanlee/collavo

0.0 0.0 0.0 2.05 MB

Official PyTorch Implementation code for realizing the technical part of CoLLaVO: Crayon Large Language and Vision mOdel to significantly improve zero-shot vision language performances (ACL 2024 Findings)

License: MIT License

Shell 0.36% Python 99.64%

collavo's Introduction

CoLLaVO: Crayon Large Language and Vision mOdel [arxiv]

๐Ÿ“ฐ News

crayon_demo

๐ŸŽจ In-Progress

  • Code is public (Only Inference Supported).
  • Downloading CoLLaVO-7B is available in Huggingface.
  • Huggingface README.md for simple running
  • Short running code for an image example is available.
  • Uploading GPT-Aided Evaluation

Official PyTorch implementation code for realizing the technical part of Crayon Large Language and Vision mOdel (CoLLaVO) to improve performance of numerous zero-shot vision language tasks. This code is developed on two baseline codes of XDecoder: Generalized Decoding for Pixel, Image, and Language accepted in CVPR 2023 and InternLM for Technical Paper.

๐Ÿ๏ธ Summary

The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from 'what objects are in the image?' or 'which object corresponds to a specified bounding box?'. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel ( CoLLaVO), which incorporates instruction tuning with Crayon Prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in numerous VL benchmarks in a zero-shot setting.

๐Ÿš€ Highlights

Figure. Zero-shot performance of CoLLaVO-7B on challenging VL datasets compared with closed-source VLMs: GPT-4V, Gemini-Pro, Qwen-VL-Plus. Note: The scores of MME are rescaled by 1/20 to match the scales with the accuracies of others.

Figure. Demonstrating the efficiency and effectiveness of CoLLaVO compared with those of other VLMs. Note that accuracy is measured on SEED-IMG.

Table. Measuring four metrics: Accuracy, Precision, Recall, F1-score on three types of question answering to evaluate hallucination of vision language models: Adversarial, Random, and Popular in POPE.

๐Ÿ“– Citation

@article{lee2024collavo,
  title={CoLLaVO: Crayon Large Language and Vision mOdel},
  author={Lee, Byung-Kwan and Park, Beomchan and Kim, Chae Won and Ro, Yong Man},
  journal={arXiv preprint arXiv:2402.11248},
  year={2024}
}

Download CoLLaVO-7B (Under Preparation)

GQA SQA-IMG TextVQA POPE MME-P MME-C MM-Bench MMB-CN MM-Vet Q-Bench
BLIP2-13B 42.4 61.0 42.5 85.3 1293.8 290.0 - - 22.4 -
InstructBLIP-7B 49.5 49.2 60.5 50.1 - - 36.0 23.7 25.6 56.7
Qwen-VL-Chat-7B 57.5 68.2 61.5 - 1487.5 360.7 60.6 56.7 - -
LLaVA1.5-7B 62.0 66.8 58.2 85.9 1510.7 293.8 64.3 58.3 30.5 58.7
CoLLaVO-7B 61.4 80.7 64.2 87.2 1689.7 525.0 83.0 82.1 40.3 67.6

๐Ÿ“‚ Directory Layout

.
โ”œโ”€โ”€ asset                           # Required package lists (Important)
โ”œโ”€โ”€ trainer                         # Training CoLLaVO and initializing optimizer (Not Support Now)
โ”œโ”€โ”€ utils                           # Michallengeous util files (Not important)
โ”œโ”€โ”€ collavo                         # CoLLaVO architecture & loading collavo (Important)
โ”œโ”€โ”€ pipeline                        # Evaluating zero-shot vision language tasks (Important)
โ”‚
โ”œโ”€โ”€ datasets                        # Important
โ”‚   โ”œโ”€โ”€ dataset_mappers             # data parsing including augmentation for loader
โ”‚   โ”œโ”€โ”€ evaluation                  # measuring evaluation for each dataset 
โ”‚   โ””โ”€โ”€ registration                # register dataset
โ”‚
โ”œโ”€โ”€ configs                         
โ”‚   โ”œโ”€โ”€ accel                       # Accelerate Config files (Support DDP)
โ”‚   โ””โ”€โ”€ collavo_eval.yaml           # Config of evaluating collavo
โ”‚
โ”œโ”€โ”€ modeling                        # Not Important
โ”‚   โ”œโ”€โ”€ architectures               # training the prototype of collavo (Not Support Now)
โ”‚   โ”œโ”€โ”€ utils                       # utils for modeling (Not important)
โ”‚   โ””โ”€โ”€ BaseModel                   # loading and saving model
โ”‚
โ”œโ”€โ”€ lbk_entry.py                    # main code of control tower (Important)
โ”œโ”€โ”€ run                             # bash file for running the evaluation (Important)
โ”‚
โ”œโ”€โ”€ install                         # install required packages (Important)
โ””โ”€โ”€ README.md

๐Ÿ’ก How to Run?

In bash file of install, you should first run the following lines.

conda create -n collavo python=3.9
conda activate collavo
conda clean -a && pip cache purge
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r assets/requirements/requirements.txt
pip install -r assets/requirements/requirements_custom.txt
pip install flash-attn --no-build-isolation

In addition, you should set the following environment variables to set the dataset path.

export DETECTRON2_DATASETS=/path/to/dataset
export DATASET=/path/to/dataset
export DATASET2=/path/to/dataset
export VLDATASET=/path/to/dataset

Download CoLLaVO-7B Model and then you can run the demo.py

"""
CoLLaVO-7B

Simple Six Steps
"""

# [1] Loading Image
from PIL import Image
from torchvision.transforms import Resize
from torchvision.transforms.functional import pil_to_tensor
image_path = "figures/crayon_image.jpg"
image = Resize(size=(490, 490), antialias=False)(pil_to_tensor(Image.open(image_path)))

# [2] Instruction Prompt
prompt = "Describe this image in detail"

# [3] Loading CoLLaVO
from collavo.load_collavo import prepare_collavo
collavo_model, collavo_processor, seg_model, seg_processor = prepare_collavo(collavo_path='BK-Lee/CoLLaVO-7B', bits=4, dtype='fp16')

# [4] Pre-processing for CoLLaVO
collavo_inputs = collavo_model.demo_process(image=image, 
                                    prompt=prompt, 
                                    processor=collavo_processor,
                                    seg_model=seg_model,
                                    seg_processor=seg_processor,
                                    device='cuda:0')

# [5] Generate
import torch
with torch.inference_mode():
    generate_ids = collavo_model.generate(**collavo_inputs, do_sample=True, temperature=0.9, top_p=0.95, max_new_tokens=256, use_cache=True)

# [6] Decoding
answer = collavo_processor.batch_decode(generate_ids, skip_special_tokens=True)[0].split('[U')[0]
print(answer)

If you want to valiate zero-shot performances in numerous datasets, then running the bash file 'run'.

# CoLLaVO-Experiment
GPU_DEVICE="0,1,2,3,4,5"
length=${#GPU_DEVICE}
n_gpu=$(((length+1)/2))
main_port=10000
test_batch=1

CUDA_VISIBLE_DEVICES=$GPU_DEVICE \
accelerate launch --config_file configs/accel/ddp_accel.yaml \
    --num_processes=$n_gpu \
    --main_process_port=$main_port \
    lbk_entry.py eval \
    --conf_files configs/collavo_eval.yaml \
    --overrides \
    WANDB False \
    DATASETS.TEST mme \
    PIPELINE MMEPipeline \
    MME.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    SCIENCEQA.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    POPE.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    MMBENCH.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    MMVET.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    AI2D.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    HALLUSIONBENCH.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    MATHVISTA.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    QBENCH.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    SEED.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    SAVE_DIR /path/to/CoLLaVO_DIR \
    WEIGHT True \
    RESUME_FROM /path/to/CoLLaVO_WEIGHT \

Note that, you should change the two parts to evaluate the dataset you want. (This is very important!!)

DATASETS.TEST

  • GQA: gqa_testdev_balanced
  • SQA-IMG: scienceqa_test
  • TextVQA: textvqa_val
  • POPE: pope_test
  • MME: mme
  • MM-Bench: mmbench or mmbench_cn
  • MM-Vet: mm-vet
  • Q-Bench: qbench_dev
  • MATHVISTA: mathvista_testmini
  • AI2D: ai2d
  • SEED-IMG: seed
  • HallusionBench: hallusionbench

PIPELINE

  • GQA: GQAPipeline
  • SQA-IMG: SQAPipeline
  • TextVQA: TextVQAPipeline
  • POPE: POPEPipeline
  • MME: MMEPipeline
  • MM-Bench: MMBenchPipeline
  • MM-Vet: MMVetPipeline
  • Q-Bench: QBenchPipeline
  • MATHVISTA: MathVistaPipeline
  • AI2D: AI2DPipeline
  • SEED-IMG: SEEDPipeline
  • HallusionBench: HallusionPipeline

GPT-4 Aid Evalution for AI2D, MM-Vet, SEED-IMG

This code will be soon public!

๐Ÿ… Download Datasets

๐Ÿ“‚ Dataset Directory (/path/to/dataset)

.
โ”œโ”€โ”€ GQA                             # GQA
โ”œโ”€โ”€ ScienceQA                       # SQA-IMG
โ”œโ”€โ”€ TextVQA                         # TextVQA
โ”œโ”€โ”€ POPE                            # POPE
โ”œโ”€โ”€ MME_Benchmark_release_version   # MME
โ”œโ”€โ”€ MMBench                         # MM-Bench
โ”œโ”€โ”€ mm-vet                          # MM-Vet
โ”œโ”€โ”€ LLVisionQA-QBench               # Q-Bench
โ”œโ”€โ”€ MathVista                       # MathVista
โ”œโ”€โ”€ SEED-Bench                      # SEED-IMG
โ”œโ”€โ”€ ai2d                            # AI2D
โ””โ”€โ”€ HallusionBench                  # HallusionBench

collavo's People

Contributors

byungkwanlee avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.