Coder Social home page Coder Social logo

haiyang-w / git Goto Github PK

View Code? Open in Web Editor NEW
215.0 6.0 10.0 12.73 MB

Official Implementation of "GiT: Towards Generalist Vision Transformer through Universal Language Interface"

Home Page: https://arxiv.org/abs/2403.09394

License: Apache License 2.0

Dockerfile 0.09% Python 99.03% Shell 0.88%
foundation-models perception transformer unified vision-and-language vision-transformer

git's Introduction

💥 GiT: the first successful LLM-like general vision model unifies various vision tasks only with a vanilla ViT

arXiv License Hits GitHub issues GitHub closed issues
Twitter

This repo is the official implementation of paper: GiT: Towards Generalist Vision Transformer through Universal Language Interface as well as the follow-ups. We have made every effort to ensure that the codebase is clean, concise, easily readable, state-of-the-art, and relies only on minimal dependencies.

GiT: Towards Generalist Vision Transformer through Universal Language Interface

Haiyang Wang*, Hao Tang*, Li Jiang $^\dagger$, Shaoshuai Shi, Muhammad Ferjad Naeem, Hongsheng Li, Bernt Schiele, Liwei Wang $^\dagger$

Overview

💫 What we want to do

Reducing Human Bias in Model Architecture Designing

We aim to unify the model architecture of vision and language through a plain transformer, reducing human biases such as modality-specific encoders and task-specific heads. A key advancement in deep learning is the shift from hand-crafted to autonomously learned features, inspiring us to reduce human-designed aspects in architecture. Moreover, benefiting from the flexibility of plain transformers, our framework can extend to more modalities like point clouds and graphs.

🤔 Introduction

Building a universal computation model across all tasks stands as the cornerstone of artificial intelligence, reducing the need for task-specific designs. In this project, we introduce GiT (Generalist Vision Transformer). GiT has the following characteristics:

  • 😮 Minimalist architecture design similar to LLM: GiT consists solely of a single transformer, without the inclusion of additional vision encoders and adapters.
  • 🚀 Covering all types of visual understanding tasks: GiT addresses a spectrum of visual tasks, including object-level tasks (e.g., object detection), pixel-level tasks (e.g., semantic segmentation), and vision-language tasks (e.g., image captioning).
  • 🤗 Achieving multi-task ability by unified language interface: Similar to LLM, GiT observes the task synergy effect in multi-task training. It fosters mutual enhancement across tasks, leading to significant improvements compared to isolated training.
  • 🔥 Strong performance on zero-shot and few-shot benchmark: GiT scales well with model size and data, demonstrating remarkable generalizability across diverse scenarios after training on 27 datasets.

📣 News

  • [24-3-15] 🚀 Training and inference Code is released.
  • [24-3-15] 👀 GiT is released on arXiv.

👀 Todo

  • Release the arXiv version.
  • SOTA performance of generalist model on multi-tasking benchmark.
  • SOTA performance of generalist model on zero- and few-shot benchmark.
  • Clean up and release the inference code.
  • Clean up and release the training code.
  • Engineering Optimization (faster).
  • Joint Training including Language (stronger).
  • Code Refactoring (now is also a little dirty, sorry for that).

🚀 Main Results

Single-Task Benchmark

Model Params Metric Perfomance ckpt log config
GiT-Bdetection 131M mAP 45.1 ckpt log config
GiT-Binsseg 131M mAP 31.4 ckpt log config
GiT-Bsemseg 131M mIoU 47.7 ckpt log config
GiT-Bcaption 131M BLEU-4 33.7 ckpt log config
GiT-Bgrounding 131M [email protected] 83.3 ckpt log config

Multi-Tasking Benchmark

Model Params Detection Ins Seg Sem Seg Caption Grounding ckpt log config
GiT-Bmulti-task 131M 46.7 31.9 47.8 35.3 85.8 ckpt log config
GiT-Lmulti-task 387M 51.3 35.1 50.6 35.7 88.4 ckpt log config
GiT-Hmulti-task 756M 52.9 35.8 52.4 36.2 89.2 ckpt log config

Task Synergy in Multi-Tasking Training

Model Params Detection Ins Seg Sem Seg Caption Grounding
GiT-Bsingle-task 131M 45.1 31.4 47.7 33.7 83.3
Improvement +1.6 +0.5 +0.1 +1.6 +2.5
GiT-Bmulti-task 131M 46.7 31.9 47.8 35.3 85.8

Zero-shot benchmark

Model Params Cityscapes
(Det)
Cityscapes
(Ins Seg)
Cityscapes
(Sem Seg)
SUN RGB-D nocaps ckpt log config
GiT-Bmulti-task 131M 21.8 14.3 34.4 30.9 9.2 ckpt log config
GiT-Buniversal 131M 29.1 17.9 56.2 37.5 10.6 ckpt log config
GiT-Luniversal 387M 32.3 20.3 58.0 39.9 11.6 ckpt log config
GiT-Huniversal 756M 34.1 18.7 61.8 42.5 12.6 ckpt log config

Few-shot benchmark

Model Params DRIVE LoveDA Potsdam WIDERFace DeepFashion config
GiT-Bmulti-task 131M 34.3 24.9 19.1 17.4 23.0 config
GiT-Buniversal 131M 51.1 30.8 30.6 31.2 38.3 config
GiT-Luniversal 387M 55.4 34.1 37.2 33.4 49.3 config
GiT-Huniversal 756M 57.9 35.1 43.4 34.0 52.2 config

🛠️ Quick Start

Installation

conda create -n GiT python=3.8

conda activate GiT

# We only test in 1.9.1, may be other versions are also worked.
pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html

pip install -U openmim
mim install "mmengine==0.8.3"
mim install "mmcv==2.0.1"
pip install "transformers==4.31.0"

git clone [email protected]:Haiyang-W/GiT.git
cd GiT
pip install -v -e .
pip install -r requirements/optional.txt
pip install -r requirements/runtime.txt

# if you face ChildFailedError, please update yapf
pip install yapf==0.40.1
  • Please download pretrained text embedding from huggingface and organize the downloaded files as follows:
GiT
|──bert_embed.pt
|——bert_embed_large.pt
|——bert_embed_huge.pt
  • (Optional) Install Java manually for image caption evaluation. Without Java, you can train image caption normally, but fail in caption evaluation.
  • (Optional) Install lvis api for LVIS dataset.
# current path is ./GiT
cd ..
pip install git+https://github.com/lvis-dataset/lvis-api.git

Dataset Preparation

Multi-Tasking Dataset

Multi-tasking benchmark contains coco2017 for object detection and instance segmentation, ade20k for semantic segmentation, coco caption for image caption, and refcoco series for visual grounding.

GiT
|──data
|  |──ade
|  |  |──ADEChallengeData2016
|  |  |  |──annorations
|  |  |  |  |──training & validation
|  |  |  |──images
|  |  |  |  |──training & validation
|  |  |  |──objectInfo150.txt
|  |  |  |──sceneCategories.txt
|  |──coco
|  |  |──annotations
|  |  |  |──*.json
|  |  |──train2017
|  |  |  |──*.jpg
|  |  |──val2017
|  |  |  |──*.jpg
|  |──coco_2014
|  |  |──annotations
|  |  |  |──*.json
|  |  |  |──coco_karpathy_test.json
|  |  |  |──coco_karpathy_train.json
|  |  |  |──coco_karpathy_val_gt.json
|  |  |  |──coco_karpathy_val.json
|  |  |──train2014
|  |  |  |──*.jpg
|  |  |──val2014
|  |  |  |──*.jpg
|  |  |──refcoco
|  |  |  |──*.p

Universal Dataset

We use 27 datasets in universal training. For more details about dataset preparation, please refer to here.


🚨 We only list part of the commands (GiT-B) below. For more detailed commands, please refer to here.

Training

Single Task

Detection

bash tools/dist_train.sh configs/GiT/single_detection_base.py  ${GPU_NUM} --work-dir ${work_dir}

Multi Task

GiT-B

bash tools/dist_train.sh configs/GiT/multi_fivetask_base.py  ${GPU_NUM} --work-dir ${work_dir}

Universal Training

GiT-B

bash tools/dist_train.sh configs/GiT/universal_base.py  ${GPU_NUM} --work-dir ${work_dir}

Testing

Single Task

Detection

bash tools/dist_test.sh configs/GiT/single_detection_base.py ${ckpt_file} ${GPU_NUM} --work-dir ${work_dir}

Multi Task

GiT-B

bash tools/dist_test.sh configs/GiT/multi_fivetask_base.py ${ckpt_file} ${GPU_NUM} --work-dir ${work_dir}

Zero-shot and few-shot

Please download universal pretrain weight from huggingface and organize files as follows:

GiT
|──universal_base.pth
|——universal_large.pth
|——universal_huge.pth

Zero-shot

bash tools/dist_test.sh configs/GiT/zero-shot/zero_shot_cityscapes_det_base.py ${ckpt_file} ${GPU_NUM} --work-dir ${work_dir}

Few-shot

bash tools/dist_train.sh configs/GiT/few-shot/few_shot_drive_det_base.py ${GPU_NUM} --work-dir ${work_dir}

Customize Dataset

If you want to use GiT on your own dataset, please refer here for more details.

🚀 Lightweight Version

If your GPU memory is insufficient, you can reduce the resolution like here, where we lower the detection resolution to 672. It requires ~20 hours of training and reaches ~41.5 mAP.

👍 Acknowledgement

  • MMDetection The codebase we built upon. Thanks for providing such a convenient framework.
  • BLIP We extract text embedding from BLIP pretrain models and use the web caption filtered by BLIP. Thanks for their efforts in open source and cleaning the dataset.

📘 Citation

Please consider citing our work as follows if it is helpful.

@article{wang2024git,
    title={GiT: Towards Generalist Vision Transformer through Universal Language Interface},
    author={Haiyang Wang and Hao Tang and Li Jiang and Shaoshuai Shi and Muhammad Ferjad Naeem and Hongsheng Li and Bernt Schiele and Liwei Wang},
    journal={arXiv preprint arXiv:2403.09394},
    year={2024}
}

✨ Star History

Star History Chart

git's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

git's Issues

Details regarding few-shot and zero shot datasets

Hi,

Thank you for the code and the Readme which are both very well organized. I am trying to setup the few-shot and zero-shot datasets. Is there any details I need to take account of?

Thank you!.

LoveDA Few Shot training annotation format

Hi,

Thanks for making the code public. I'm trying to run few shot learning on LoveDA dataset. Once I start training, I get the following error.

FileNotFoundError: [Errno 2] No such file or directory: 'data/loveDA/ann_dir/train/1040.png'

My loveDA folder structure is as follows:

GiT
|──data
│   ├── loveDA
│   │   ├── img_dir
│   │   │   ├── train
│   │   │   ├── val
│   │   │   ├── test
│   │   ├── ann_dir
│   │   │   ├── train
│   │   │   |      |----- 1040.png.json
│   │   │   ├── val

as described here: https://github.com/Haiyang-W/GiT/blob/main/tools/dataset_preprocess/dataset_prepare.md

However, I see in the same link that I should run python tools/dataset_converters/loveda.py /path/to/loveDA but I don't see the file in the cloned repo. If this code is needed to fix this issue, can you please provide me with the dataset_converter file for loveda dataset?

KeyError: 'Duplicate key is not allowed among bases'

When using the large and huge few-shot configurations for loveda dataset, since the config uses both the loveda base config and the git_large config, there are overlapping keys. Issue can be resolved by creating a copy of loveda_base and replacing git_base with git_large and load_from base to large.

Is there any other way to fix this? Or am I missing something here?

Thank you!

warning upon loading Tokenizer?

I've encountered the following warnings when trying fast-mode,
I've tried loading BlipTokenizer, but why turned out to be the "BlipTokenizer" warning?

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'BertTokenizer'.
The class this function is called from is 'BlipTokenizer'.

Besides, my demo run on an offline env,
where I can not download pretrained weights online

Is there any suggestions? Many thanks!

AssertionError in runpy.py

I try to use this awesome model like below

image

but it's not working.

Could i ask you what is the problem?

Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.