Coder Social home page Coder Social logo

phoenixdigitalfx / gpt4roi Goto Github PK

View Code? Open in Web Editor NEW

This project forked from jshilong/gpt4roi

0.0 0.0 0.0 15.42 MB

GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

License: Other

Shell 0.03% JavaScript 0.14% C++ 15.40% Python 75.94% C 0.04% CSS 0.03% Cuda 8.27% Makefile 0.02% HTML 0.11% Batchfile 0.02% Dockerfile 0.01%

gpt4roi's Introduction

GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest πŸ”₯ Demo πŸ”₯


Single-Region Understanding


Multiple-Region Understanding

Introduction


GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
Shilong Zhang*, Peize Sun*, Shoufa Chen*, Min Xiao, Wenqi Shao ,Wenwei Zhang, Kai Chen, Ping Luo
(*Equal Contribution)

Updates

  • [July 7] All training and inference code has been released, you can try demo here πŸ”₯πŸ”₯πŸ”₯

Contents

Install

  1. Clone the GPT4RoI
git clone https://github.com/jshilong/gpt4roi.git
cd gpt4roi
  1. Create the env
conda create -n gpt4roi python=3.10 -y
conda activate gpt4roi
pip install --upgrade pip  # enable PEP 660 support
pip install setuptools_scm
pip install --no-cache-dir  -e .
# please use conda re-install the torch, pip may loss some runtime lib
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia 
  1. Install the flash-attn package
pip install ninja
pip install flash-attn --no-build-isolation
  1. install the mmcv-1.4.7 package Make sure that your nvcc -V is consistent with cudatookit version of python -c "import torch;print(torch.version.cuda).
cd mmcv-1.4.7
MMCV_WITH_OPS=1 pip install -e .

Data

Our dataset includes RefCOCO, RefCOCO+, RefCOCOg, Visual Genome, Flickr30K entities, and the VCR dataset. We are sincerely grateful to the creators of these datasets, especially for the VCR dataset, for their forward-thinking in creating these dataset.

The dataset section of this repository may appear somewhat messy, especially the VCR part(still finishing), which may cause GPT4RoI not be very user-friendly. We are currently working on formulating the datasets into a unified format and will be accompanying them with stronger models. Please stay tuned for updates.

You can download the corresponding dataset from the official website and organize it as follows. Afterwards, you can modify the gpt4roi/configs/dataset_config.json file to select the specific dataset you want to use:

GPT4RoI
β”œβ”€β”€ data
β”‚   β”œβ”€β”€ coco_det
β”‚   β”‚   β”œβ”€β”€ annotations
β”‚   β”‚   β”‚      β”œβ”€β”€instances_train2017.json
β”‚   β”‚   β”œβ”€β”€ train2017/
β”‚   β”œβ”€β”€ mdetr_annotations
β”‚   β”‚          β”œβ”€β”€finetune_refcoco_train.json
β”‚   β”‚          β”œβ”€β”€finetune_refcoco+_train.json
β”‚   β”‚          β”œβ”€β”€finetune_refcocog_train.json
β”‚   β”‚          β”œβ”€β”€final_flickr_mergedGT_train.json
β”‚   β”œβ”€β”€ coco_imgs/
β”‚   β”œβ”€β”€ flickr30k-images/
β”‚   β”œβ”€β”€ visual_genome
β”‚   β”‚          β”œβ”€β”€train.json
β”‚   β”‚          β”œβ”€β”€vg_all/
β”‚   β”œβ”€β”€ llava
β”‚   β”‚   β”œβ”€β”€ llava_instruct_150k.json
β”‚   β”‚   β”œβ”€β”€ llava_150k_bbox_pred_results.pkl
β”‚   β”œβ”€β”€ vcr
β”‚   β”‚   β”œβ”€β”€ train.jsonl
β”‚   β”‚   β”œβ”€β”€ vcr1images/

NOTE

  1. coco_imgs should contains all coco image(you can soft link them to this directory.
  2. We use Visual_Genome_Dataset_V1.2, you should soft all vg images to vg_all.
  3. llava_150k_bbox_pred_results.pkl contains the detection predicted results with EVA-02-DET. We appreciate their work.

Weights

coming soon.

We release coming soon. weights as delta weights to comply with the LLaMA model license. You can add our delta to the original LLaMA weights to obtain the LLaVA weights.

Instructions:

  1. Get the original LLaMA weights in the huggingface format by following the instructions here.
  2. Use the following scripts to get LLaVA weights by applying our delta coming soon.

GPT4RoI-7B

This conversion command needs around 30 GB of CPU RAM.

python3 -m llava.model.apply_delta \
    --base /path/to/llama-7b \
    --target /output/path/GPT4RoI-7B-v0 \
    --delta jshilong/GPT4RoI-7B-v0

Training

GPT4RoI is trained on 8 A100 with the following code.

STAGE 1

You should modify the gpt4roi/configs/dataset_config.json to make sure you only use the dataset of stage1.

bash train_stage1.sh

You should modify the gpt4roi/configs/dataset_config.json to make sure you only use the dataset of stage2.

bash train_stage2.sh

Gradio

Please install Gradio Box first.

python gpt4roi/app.py

NOTES

  1. prompt format in GPT4RoI You should always use <region1>, <region2>... to refer the new bounding box in the image when you first draw them. Then you can use normal region 1 in the conversation to refer the instance.
  2. You should always click the clear all buttul and waiting the clear process finished before you start a new conversation.


Multiple Rounds of Dialogue

Acknowledge

  • LLaVA: The codebase we built upon.
  • Vicuna: The LLM we used.
  • VCR: We get strong region reasoning ability from this forward thinking dataset.

If you find GPT4RoI useful for your your research and applications, please cite using this BibTeX:

@misc{zhang2023gpt4roi,
      title={GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest}, 
      author={Shilong Zhang and Peize Sun and Shoufa Chen and Min Xiao and Wenqi Shao and Wenwei Zhang and Kai Chen and Ping Luo},
      year={2023},
      eprint={2307.03601},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

gpt4roi's People

Contributors

jshilong avatar peizesun avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.